4 Behaviours Separating Good Data Science From Great Data Science

For the interest of data scientists and data owners looking to improve their approach to data science projects, this post looks at a few ways we can take good data science practices and elevate them to great ones.

Data science, at its heart, is a methodology for extracting insights and information from raw data. It seeks to identify hidden patterns and relationships in data, quantify the outcomes and implications, and then condense them into a story that humans can tell and understand.

When starting from a mass of unstructured, unclean data in tens or hundreds of different locations, a relatively humble effort can yield noticeable outputs in terms of useful trends, illuminating correlations or smoking-gun anomalies. However, taking the step from finding some useful insights to extracting the full wealth of information can be tough, and it often requires a more nuanced approach to the data science workflow. In other words, it’s the difference between good data science and great data science.

1. Asking the Right Question

We’ll begin with the real basics. When we start a project, we’re often seeking to provide an answer to some question or requirement posed by our clients. It may be something specific, such as ‘how do we quickly identify customer service calls which need to be escalated to another department', or something as broad as 'which train services are likely to experience long delays next week', and answering this question is the motivation for every activity to follow. Therefore, if we want to deliver value and meet the needs of our customers, we have to answer the question effectively.

The Good:

Fully understand the question being asked, identify the relevant metrics by which success is to be judged, and identify the subject matter experts who hold the data.

The Great:

First engage with the client to understand if the question asked is the one they really need to answer.

Now this might sound, on first hearing, like a patronising thing to do. Our clients know what they want, otherwise they wouldn’t be at the stage of engaging with a data scientist, right? Well, yes and no. It’s likely that they have a complete understanding of the problem they wish to solve, but the words with which they’ve communicated that problem may not encode all of the requirements and necessary information. Equally, the request could be unconsciously biased by what they perceive the best solution to be, or it could be addressing a symptom rather than the underlying cause.

By taking the question at face value, it’s easy to commit time and resources to doing brilliant, rigorous work on a model, an analysis or a service, just to find that it doesn’t really help to resolve the underlying issue. In the worst case, it’s only upon review at the end of the project that the mismatch between what was requested and what was needed comes to light. Asking why the question needs to be answered, or why the requirement needs to be satisfied, can be instrumental in guiding our efforts in the right direction.

Let’s take the example of ‘how do we quickly identify customer service calls which need to be escalated to another department’. We could answer this question by modelling whether queries are escalated or not based on the topic of the call, the demographic of the caller, the customer service representative, the length of the call and so on. On the back of this, our client could then implement a new script for customer service representatives which asks the relevant questions, feeds the results into the model, and sends the caller to the required department. However, if it then comes to light that the issue isn’t the literal time spent talking to a customer service representative, but in fact lies in trying to screen out the calls which take the time of other departments, then our perfectly valid answer to the question is far less useful.

Instead, we could have spent that time analysing the topics which frequently require escalation and then constructing a chatbot which leads our client’s customers to the information they require. Clearly, this would be vastly different to the initial solution offered, but would do a much better job of satisfying the true need. It’s this kind of close inspection and interrogation of the question that can be the difference between a successful project and an interesting but fruitless piece of work.

2. Continuous Collaboration

As data scientists, our primary skills lie in manipulating data, be that data ingestion, the identification of relevant features versus sources of noise, choosing the best algorithms for prediction, and communicating our results. This means that we have the tools to work with data from any industry and any domain. However, we won’t necessarily have the subject matter knowledge to make inferences on what information we should be harvesting, what’s most important, or where we should be looking for additional features.

Luckily, our clients are the subject matter experts in their own data and their own fields, and by working with them we can leverage their knowledge and understanding of the domain to help focus our approach.

The Good:

Engage with subject matter experts early in the process, and identify the most likely predictors or any additional features of interest.

The Great:

Engage with subject matter experts continually throughout the project, ensuring that any insights can be shared and captured as the project evolves.

Data science is an iterative process, and so our engagement with subject matter experts must be iterative as well. We engineer features, train models, assess performance and importance, and then try again until we’ve optimised our solution. During that process, we may make discoveries about which features are predictive, and then seek out other features related to, or derived from, the most important ones.

Although we can research the subject area independently and search for new sources of information, we will never be able to replicate the experience and knowledge of subject matter experts, nor should we presume that we managed to ask every relevant question at project kickoff. By keeping an open loop of communication between ourselves and the subject matter experts, we enable ourselves to identify and access the relevant extra details sooner, thereby reducing the chance that highly predictive features are missed.

3. Knowing When to Step Back

As an iterative process, following the data science pipeline to an optimised model can take many attempts over a period of time. We can re-engineer features, adjust anomaly thresholds, change imputation strategies, and try different encoding approaches. We can test a whole variety of different models. We can tune every hyper-parameter, and test every subset of features. In other words, if our model isn’t performing as well as we’d like then there’s a huge quantity of dials for us to turn and modifications for us to try, a process which can go on and on for as long as our patience allows.

The Good:

Use their experience to reduce the subset of models which are appropriate, and perform many rounds of tuning and model selection to optimise performance.

The Great:

Know when no amount of minor changes to the input data or modelling strategy will yield good predictions.

Sometimes, the data we have available just aren’t sufficiently predictive of the target for us to train a reliable model. Some quantities are governed by inherently random processes, or by circumstances so complex and difficult to quantify that there would be no reasonable chance to gather data on them. In other cases, there are important quantities which were not recorded and cannot be retroactively populated. Whatever the reason, if we don’t have information that accurately predicts the target then we’re doomed from the start. Rather than training an increasingly complex model which yields predictions no better than guessing, it can sometimes be the considered choice to step back and acknowledge that there is no benefit in producing a model on that dataset.

This is obviously a disappointing conclusion to reach, but it can still provide value in terms of shaping our clients' information capture processes and identifying potential blind spots in their data. A technique for addressing this possibility more proactively is by approaching data science projects with a feasibility study upfront, fitting some low-optimisation models to the data to get a sense for how predictive they are of the target. This can set expectations for both us and our clients, and prevents the waste of time and money on both sides. Ultimately, we are all better served by focusing efforts on something we can achieve, rather than delivering a model which provides poor inferences and leads to poor decision-making.

4. Telling the Story

The final stage of a data science project is for us to communicate what was learned or created from the exercise. It might include a summary of our model’s prediction accuracy, or a confusion matrix outlining model performance against each category, or a report on how our model clustered the data points and what that might imply for the underlying patterns and trends. Whatever form it takes, it has to explain the work which was carried out and summarise the outputs of our project for a diverse audience.

The Good:

Generate reports and visualisations to explain the dataset and the resultant model.

The Great:

Construct a human-driven narrative from the process and the results, leading the reader from data to inference with visualisations and key details, and providing insight into the next steps.

Clear communication is subjective, and any report or summary needs to be targeted to its audience. The report we give to another data scientist will be different from the report we give to a senior leader, or to someone new to the domain. The common thread, however, is the need to tell a story, taking our audience from the question to the answer in a way which sticks in their memory.

No matter how rigorous the analysis or how clear the visualisations, if our communication does not engage the reader and impress upon them how the output of the project satisfies their need, then it is far less likely that the maximum value will be extracted from our work. The best way to do that is to show how our model will impact the people in our client’s organisation.

We can demonstrate the impact of the problem on someone’s work, and then show how our model’s predictions can mitigate it. We can show how a person’s duties are made easier or more efficient using our model. We can quantify how much money or time can be saved in terms that matter to our audience. Whatever the industry or application, when we highlight clearly how the model will impact the people in our client’s organisation, it becomes the hook that sticks the value of our project in their mind, and enables them to secure buy-in from other members of their organisation.

Further to this, our reports can clearly describe and acknowledge the limitations of our models, and identify any extensions and follow-on work which could be carried out. By proactively thinking about how more insights could be extracted from our clients' data we are highlighting how the story might be continued, opening opportunities for future collaboration and helping them to further improve their business decision-making.

Summary

This was a quick overview of four behaviours and approaches which separate good data science from great data science.

By being sure to ask the right question, we ensure that our efforts are focused appropriately to address the true need driving the project.

By continuously collaborating with subject matter experts when answering that question, we ensure that their domain-knowledge and irreplaceable experience are leveraged in the iterative process of feature engineering.

By stepping back from the details and taking a holistic view of the project, we can ensure that our work remains contextualised and that we do not chase a futile outcome.

Finally, by telling the story of the data and modelling, we ensure that our clients have a full appreciation of how the output satisfies their need, and how more value might be extracted in future.