Data Wrangling

Data wrangling is the process of getting ahold of your data. This changes depending on the context and overall goal for the analysis, and might be as simple as importing a library (where some datasets are already contained in libraries) to downloading a large .csv file, to feeding in JSON into a pandas data frame, etc. Sometimes, data wrangling is called "munging".

The steps presented are optional. You have to think about your specific problem, and apply them as needed.

Data Retrieval

First, get your dataset. Sometimes, datasets are fairly large and cannot be stored locally, so we need to understand how we are going to analyze a dataset that is stored remotely. Other times, we are dealing with sensitive data that should not ever leave the premises, in which case we have to think about how to use that data and perform analysis on it, without it ever leaving its secured location.

Where is your data at? Where are you performing the model building or analysis at? How are you going to move the data from where it’s at, to where it needs to go? What are the security concerns, if any? Will you have networking problems with downloading large datasets?

Data Format Conversion

Here, we have data in one format, and need to convert it to another. This is sometimes where you load the data into memory, depending on your problem, for instance with pandas using read_csv(). Other times, before we load the data in, we have to convert the file format. Maybe we have data that is in an Excel spreadsheet, from which we need to export to a .csv file, then read into a pandas data frame. Or, possibly we have .jpg images we want to convert to .png.

What format is your data in? How will you hold the data in memory for the model building or analysis? What kind of format conversion do you need to do, if any?

You can find a list of common data formats here.

Combining Data

Sometimes we have many datasets that we need to combine together to make the analysis easier.

Data Cleaning

Sometimes, we know in advance that some portions of the dataset are poor. Here we can do basic steps to get our data ready for analysis. For instance, maybe we have a number of images in our dataset which we know are bad- we can just toss them out. Or, possibly we have a .csv file that had two datasets merged together that do not have the same columns, and have empty values; maybe separating them and making two datasets might help. Sometimes, removing html tags (<>) from all the data is useful.

Data cleaning at this step is optional, and often times we won’t know what kind of data cleaning we need to do until we do EDA.

Code Examples

All of the code examples are written in Python, unless otherwise noted.

Containers

These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you’ll need to run it. Click here to learn why you should be using containers, along with how to do so.

Quickstart: Download Docker, then run the commands below in a terminal.

Creating TFRecords

We create TFRecords from scratch using an image dataset.

#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:creating-tfrecords

#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:creating-tfrecords