Data Pipelining

Data pipelining in a data science context refers to setting up your data so that it feeds in one chunk of data, trains the model, saves the model state, feeds in another chunk, trains the model further, saves the model state, rinse and repeat.

Sometimes you will see "data pipelining" used in the context of data engineering, where it refers to setting up back end databases, automated systems, and/or controlling data flow between applications, such as cloud based interactive systems. To make matters even more confusing, sometimes feeding in data to train a model further requires some data engineering! These terms overlap, but the usage of it depends on the context of the situation.

Motivation

With some large datasets, it can be hard to hold all the data in memory at once while training the model. While the precise calculation (and corresponding memory requirement) is dependent on your computing equipment and what software you are using, if you only have 16GB of RAM, that essentially caps you there; maybe you have GPU RAM as well, which can also be used for the calculations, depending on your configuration. Regardless of how much computing equipment you have, when you combine an incredibly large dataset along with complex calculations (like doing differentiation for millions of parameters in a neural network), it is not hard to run into a computational brick wall, or at least one that will never finish while you are alive.

Other times, we don’t have all the data we are going to eventually train on. For instance, if we building a self driving vehicle that utilizes neural networks trained off of camera and car diagnostics, we want to be able to include data from future car rides, not just the test rides we’ve already done.

Yet another motivation for data pipelining can be the speed at which to read in data. Data pipelines, when built correctly, can speed up the I/O operations needed for learning.

Software to Build Data Pipelines

Most of the major data science toolkits out there have functions to build data pipelines, including PyTorch, Tensorflow, Caffe, Scikit Learn and more.

Code Examples

All of the code examples are written in Python, unless otherwise noted.

Containers

These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you’ll need to run it. Click here to learn why you should be using containers, along with how to do so.
Quickstart: Download Docker, then run the commands below in a terminal.

Creating TFRecords

We create TFRecords from scratch using an image dataset. TFRecords can be utilized to minimize I/O read times, implement data sharding, build data pipelines, and/or efficiently store data, among many other things.

#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:creating-tfrecords

#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:creating-tfrecords

Convolutional Neural Networks, Anomaly Detection, and Reading TFRecords

Here we explore using Convolutional Neural Networks (CNN’s) to classify images of concrete by their anomaly status (that is, whether they have a crack or not). Our data is supplied in the form of TFRecords, so we look at how to read those in for training and testing. TFRecords are an easy way to build efficient data pipelines, and this notebook explores how to set that up.

#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:cnn-anomaly-reading-tfrecords

#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:cnn-anomaly-reading-tfrecords