Introduction to Data Science
Data science is a popular topic that is growing rapidly in both interest and complexity. Rather than diving right into technical details our goal is to help build skills that allow readers to think through data science problems.
Breaking down and understanding these problems are critical to building a good foundation for a successful data science project.
There are many definitions for a data science problem. Many different groups debate the specific definition, but to our team data science and analytics is the focus on using data and scientific thinking to solve a problem.
As with any discipline this has many layers. These layers can be both technical (modeling algorithms) or conceptual (ethics in data science). It’s important to learn both to be a successful data practitioner.
Breaking it down ever further we can think about what constitutes data. The awesome and amazing thing about data science is that data is everywhere. Anything from the weather to the type of cereal you had for breakfast, or how many steps you took to work is valuable data.
Think about your day-to-day life. What data would you be interested in?
How long did you sleep?
How much have you eaten?
What topic did you learn about that you found most interesting?
These are all data points!
Data science is everywhere. From the products and shows you’re recommended to the route home and your online order of groceries it all uses data science.
If you want a fun challenge, try to think of a discipline that doesn’t use data science and then look-up
that term + data science on the internet. I bet you’ll be surprised.
Because data science is everywhere it’s also used in almost everything. When a practice becomes that pervasive it’s important to ensure that people who use it understand it and use it ethically.
These can be difficult challenges in a world driven by profit and tight deadlines. Learning how to work with data well and in the right way are continuing journeys for any data science practitioners (our team included).
There are tons of different ways to manage data science projects. In this section we will provide an overview of how we often see data science projects progress as well as some of the tips and tricks we’ve picked up.
This is not the only source of information on these types of projects, and we encourage you to find the techniques that work for you.
If you wan’t to learn more about a specific phase we encourage you to check out the resources below:
The requirements gathering phase of a data science process is both the most difficult and the most important. As we discuss in the model thinking page it’s important to spend time understanding the business requirements and how the deliverable will be used.
However, this vital iteration and planning takes time. One of the biggest tips that we have for new projects is to give yourself time to plan and research. This isn’t always possible in industry, but if you have the time to review with business partners, research the modeling focus, and dive into the data it will pay off later in your project.
Once you’ve got a good idea of your deliverables, the data science lifecycle is relatively straightforward at a high-level, but much more complicated in practice. It often follows the form
explore → model → deploy, but with a few loops added in.
In practice this is a continuous iteration of learning about your data, talking with subject matter experts, attempting modeling techniques, finding out that they don’t work as expected, revisiting your data, creating new features, modeling again, finding out how to deploy, finding out that deploying is harder than you thought, setting up testing to ensure it deploys as expected, monitoring to make sure your model is up to date. The scary thing is that it isn’t even an exhaustive list.
We say all this not to scare you away from analytics, but to hopefully illustrate that these processes are fun and challenging. It takes multiple teams to get them to work and it’s important to keep perspective and give yourself time and patience when working through them.
Coming from a traditional school mindset many data scientists often think of models as "EDA this week, modeling next week, and deployed by the end of the month". If that timeframe works then it’s amazing, but the reality is often messier and more difficult.
That’s the science part of data science. Many times, these are unknowns that we are attempting to model. It’s an awesome and fun experience, but it’s not always something you can solve in 3 days.
Communicate with everyone! It’s better to double check something then to have an issue at the end of the project.
Take time for project planning. It’s tempting to dive into the data, but planning will help keep the project on track.
Review the literature. Often data science problems have many similar solutions. Learning from others will help to better your solution.
Always iterate. Never assume that you are done with a phase. If you’re having trouble modeling maybe, you need to do something else with the data.
Test everything. Especially when moving to production. A robust testing phase will help to reduce model issues for business users.
Use frameworks. Having set meeting times, deliverable reviews, and regular goals help teams to stay on track and make progress.
Don’t be afraid to ask for help. Data science is a huge field. We can’t know everything at all times. If you need help don’t be afraid to ask for it.