Posts written by Mobilize Ops

Data Science Pipelines

A topic that comes up fairly regularly amongst data science professionals is the idea of pipelines. And I can imagine that all of the casual talk about pipes and pipelines probably makes it seem like data scientists are something more akin to plumbers than anything else… which wouldn’t be the worst characterization of the job I’ve ever heard.

Oftentimes, the goal of a data scientist is to build pipelines which might, for example:

  • Format raw data into datasets which can be quickly combined together and used for modeling purposes.
  • Build, train, and test a variety of models to identify which ones are most promising.
  • Deploy, monitor, and continually update models which are used for decision-making purposes.
  • Build these top three bullets together to create a seamless source of information that is readily available to make decisions.

What’s common throughout this (definitely non-exhaustive) list of pipelines is that data science is a field about building and evaluating processes, and I think one obvious question this brings up is, “How do we prepare and train students or young/new data scientists to build these types of processes?” We prepare them by teaching them to be critical thinkers and consumers of data first.

Teaching critical data thinkers

Training students and/or new data scientists is a pipeline problem in and of itself. Ideally, we’d get students from diverse backgrounds and perspectives interested in the field, give them a sample of the field to drive their interest, and once we’ve “hooked” them, motivate them to acquire specialized training at universities or through online coding programs. Why then is learning to be a critical thinker with data so important? Because it’s the thought process which underlies all successful data science pipelines.

People who are trained to think critically about data will spend more time thinking about how data has been sourced, who might be represented in such data, as well as who might not be represented. These thoughts can then guide the assessment or value of new data sources, decisions regarding how to format it, or how to represent that information as a data source.

Teaching students to be critical with data also teaches them how to represent or summarize information so that it’s honestly and easily interpreted, skills that are entirely necessary when it comes to evaluating competing models, monitoring model performance over time, or even just justifying business decisions to non-data experts.

The IDS to DS pipeline

One of the things I have always loved about the Introduction to Data Science high school math curriculum has been that critical thinking about data has always come first. Students get experience and exposure to data topics that are as relevant today to data scientists as they were when the curriculum was initially written. Then they get to experience, in an authentic and meaningful way,  how data scientists apply these critical thinking skills via writing code.  

Will students interested in a career in data science, at some point, need to learn lots of math, statistics, probability, and calculus? Without a doubt, just like students need more than one biology high school class if they want to become doctors. So, is getting a PhD in statistics a necessary step before we can start getting people interested in data science? Absolutely not. In fact, I would argue that giving students a glimpse of what lies at the end of a mathematics pipeline guides more students into the field than trying to piece together an existing pipeline which is already leaking.

Data science is a great career which has already benefited immensely from data scientists coming into the field with diverse backgrounds such as economics, physics, computer science, mathematics, and more. My hope is that, with courses like IDS, we’ll continue to bring in new data scientists with different views and perspectives as we continue to grow the field into the future.

About the author:

Dr. James Molyneux is a data scientist for Swyfft, LLC, where he specializes in evaluating/developing new data sources and building risk/underwriting models and workflows. He is also courtesy faculty in the Department of Mathematics at Oregon State University.