The Data Modeling Process

The pages in this section are a recommendation, not the law. Although they serve as a general guideline, you might find that skipping one might be a good idea.
Try going through these pages in order, from this introduction on down, if you are new to data modeling in general.

What follows is a template to take data in, process it, build a model from it, then analyze the results. You will find that there is a general rhythm to this process, but the rhythm sometimes quickly gets tossed out depending on the needs of the project. Sometimes, we get sparkly clean data already formatted in CSV with no missing values, ready for model building. Other times we get the opposite, and lots of it. Sometimes we aren’t sure what the best model to build is, or how to get started. This template is meant to be things to think through roughly from start to finish.

Here is a brief synopsis of each page, in order.

All pages are optional, but you will find some are more optional than others. The ones which are very situational are marked (optional).

Data Wrangling

You know where your data is, but you have to get it into a usable form before we even look at the data. For instance, maybe we want to convert a bunch of .jpg images to .png, or a .csv file to JSON. Then its time to load the data in as a usable form.

EDA: Exploratory Data Analysis

You’ve got data loaded and ready to use, whether thats a tibble in R or as a pandas Data Frame, a numpy array, or however else you have your data stored in memory. Now it’s time to get the first glance at it, and see what’s there. This is where we discover potential problems we might have with the data, such as missing data, poorly formatted strings, data that just doesn’t make any sense and is wrong for some reason, and much more. We should also get a general feel for the data, by using functions like head() to glance at it, as well as making box and whisker plots to see what the distributions look like in general.

Thinking About The Output

So we’ve seen what the data roughly looks like, and now we have to think: what are we trying to get out of this data? What should the output look like? Will it be a prediction, or an inference? Are we outputting a single variable, or many? What kind of data type are we trying to see as output? Here we try to imagine what we would like to get out of the data. How does the model we build use our data here to serve our end purpose, roughly?

Preprocessing

So we’ve discovered some problems with our data; here’s where we clean it up. We also further manipulate it to get it into a form where we can build models from, and we base this manipulation off of what we expect our output to be.

Building Data Pipelines (Optional)

Sometimes, with lots of data, its hard to build the model by training it on the entire training set from the beginning. Pipelining is a process where we can feed in a chunk of data, train the model, save the state, feed another chunk, train the model further, save the state, rinse and repeat.

Data Modeling

Now comes the fun part, where all our hard work in setting this up comes out. We set up the model, connect the data to it, and train the model.

Measuring Model Fit

So we’ve trained our model, but is it good enough? Here we look for verification as to how well we did.

Model Deployment (Optional)

Sometimes, we are building models so they can be used for some purpose continuously. Here we learn how to take our model that we trained and make it accessible to others, to develop inferences from or make predictions with.