As with many other processes in life there isn’t a single perfect way to develop a predictive model. While you build your data science skills you’ll learn what works best for you and how to design an effective and accurate model. This section contains suggestions that our team hopes will help you on that path.
Don’t focus solely on the data! Now repeat that 3 times.
Data is very important, but it isn’t the only factor in a successful project.
As mentioned above, one of the first things most people ask when starting a project is
what data will we have? It’s totally understandable to ask this question. Data is important, and without good data the models won’t be able to tell you much. However, when you’re very first starting on a project think through some of the questions below:
How will my business user/customer use the model?
Is the model ethical?
What type of modeling category does this fall in? (Regression, Classification, Other)
Are there any modeling techniques I can research to learn more about the goal?
What tools or apps am I going to use for collaboration?
How often will I communicate findings back to the business?
This isn’t a full list of questions, but it shows that there is a lot to think through when starting a project.
In our experience it is always beneficial to take time and think through a project. Review the who, what, when, where, and why. Brainstorm different processes. Do this before you even look at the dataset.
These items will all be adjusted as the project evolves and you learn more but taking time to think through them helps to ensure that everyone is on the same page and that you are designing the correct product for the project. Trust us, it’s time well spent.
Just like project planning, Agile should be adapted to your use case. The principles of Agile can very beneficial, but they aren’t relevant in all cases. That being said, they can be applied to data science in a way that helps to track tasks and iterate in a positive way.
There are lots of resources both in The Examples Book and online focused on Agile in data science. The reason that we wanted to talk about it here is because it can occasionally feel a but clumsy when run on a predictive modeling project. A few of our experiences are included below.
The core idea of Agile project management is to break work down into sprint-level tasks. Sprints are often 2 weeks long but can vary by organization and team. The team executes focused on these deliverables and then at the end of the sprint they review the deliverables and address any issues that the team faced.
When applying Agile to a process such as exploratory data analysis (EDA) it can feel a bit weird. Primarily because EDA is just that, exploratory. You don’t necessarily have a deliverable or goal other than learning more about the data. However, Agile often operates with the idea that you need to have a product at the end of a sprint. Even though it may feel uncomfortable it can be helpful to still hold to Agile when going through EDA.
In this case your deliverable becomes a report or summary to the team about the current findings. Even if the EDA is going to continue it drives positive behavior in that the team is more aware of different trends or interesting findings in the data. In addition, it helps to address issues if the analyst feels stuck during the EDA.
One of the most beneficial ideas of Agile is iteration. Iterating on the model concepts, the communication with stakeholders, and many of the other project components is a very helpful process. Agile helps to reinforce this with many of the built-in processes, such as team stand ups. These are short meetings where each team member covers what they’ve done, what they plan to do, and any issues that they face.
Even if Agile doesn’t fit within your organization, setting up meetings to touch base with project stakeholders and review plans is highly recommended.
One of the most common situations in data science is a great kick-off meeting. Everyone is excited and looking forward to seeing the project. 6 months later the teams meet back up and the data science team unveils their shiny new model. 2 minutes later they realize that the model isn’t what the business was looking for and everyone is a lot less excited.
Iterative check-in meetings help to avoid these awkward misunderstandings by reviewing and revising project expectations as you go. Spending this occasional time can help to avoid much larger issues down the road.
The last thought for model design is the idea of sharing knowledge within and between teams. In many situations we are encouraged to hoard knowledge. We don’t want to look silly or let someone get the edge on us.
While this likely works in some environments, in many others it creates silos and builds animosity between teams. Collaborative teams have a best practice of reviewing the team’s work and providing feedback.
These aren’t hyper critical teardowns (it’s not a thesis defense), but rather a collaborative session where team members can share lessons learned or questions in a positive and learning focused environment.
It also avoids a common issue with data scientists becoming so focused on their model and process that they get tunnel vision. This can lead them to miss important factors or even potentially non-ethical practices.
Promoting this open collaborative minded environment can be a major benefit to an analytics team.
One common challenge that may data scientists face is the idea that our models must be successful. This is reinforced by the way academics are structured and many businesses are run. It’s important to keep in mind that most data science questions are unanswered. The research is focused on finding out if it’s possible to build a predictive model.
If it turns out that it’s not possible it’s still valuable knowledge. Often this leads to good conversations around why it’s not possible. Maybe additional data needs to be collected in a different may. Maybe there are concerns about how the model would be used or the public perception. It’s critical as researchers to be honest about the model’s performance. As we’ll discuss further in the section on data science ethics, trying to force the model to have positive results can lead to many different issues.