Thinking About The Output

What are we building? What kind of data are we analyzing? What do we want to make happen? Are we trying to build a model that can classify, or regress? These and many other questions will be considered here, as we think about what kind of output we seek.

Choosing a Model

You may have seen the Starter Guides on choosing models. Naturally, a model choice limits the outputs. Once we have chosen a model, we then have to choose from the range of outputs it can offer, then permutate from there, or rethink the model(s) we have chosen.

Imagining The Output

You may recall from the commentary on what a function means in data science that we are looking to get some output $Y$: we put data ($X$) in, try to minimize the reducible error by modifying our function $f$, and get some output $Y$.

Is $Y$ a fitted regression line? Is it a prediction, a probability representing the chance that an image contains a stop sign in it? Is it multi label output, namely, are the labels that can be applied to it non-exclusive? Is the output a number, a string, an image, or what kind of data type? Are we trying to deploy the model? Are we just building a graph one time and thats it? If you can go through each model choosing page and identify roughly where your output should lie there, often the model and output will naturally choose itself.

For instance, image classification is a very common neural network task. The output in that case would be (implicitly) a model which can make predictions, and the individual probability predictions themselves, a numerical value between 0 and 1. These predictions might come in arrays, or they might be singular one-off predictions.

Being really specific about the form of output is essential. Maybe you aren’t sure when you start, but over time you will get used to knowing roughly what you want as an output- it comes with practice.

Multi-Label vs. Multi-Class Output

For classification problems, there is a notion of multi-label vs. multi-class output. Multi-class refers to having more than 2 classes as input or output (for instance, with images of dogs, you might have 5 different breeds represented, but a dog can only be 1 breed at once); here the output is exclusive to one label, despite there being more than 2 labels. Multi-label refers to having more than 2 of the labels attached to each input or output (for instance, when classifying flower colors, there could be multiple different colors on the flower).

The type of output here has a direct impact on the metrics used to measure performance. For instance, categorical cross entropy is commonly used for multi-label output.

Summarizing Output Types

Try to summarize the following things to get a clear idea of your output:

What kind of model(s) am I building? K-nearest neighbors, multivariable regression, etc?
What data types (floats, strings etc) will be outputted?
What shape will the data take on when its outputted? Will it be a scalar? An image? An array? A list of probabilities with the likelihood for each label?
What kind of analysis am I doing in general? Computer vision, NLP, etc?
What kind of metrics should I be using to measure the output?
Are we building dashboards, interactive webapps, graphs, databases, predictions, etc out of this? What form do they need the data in to be able to feed in to them?
Go through each of the pages in the choosing models section, and write out if your problem seems to prefer a particular type of model. Maybe you think a supervised model is best, or you’re sure that regression is the best type of analysis for your problem. If you aren’t sure, try a few different ways and see if you can develop intuition!