Tuesday, June 9, 2015

Why your predictive analytics solution might just not work

I got a fair bit of interest and comments with my previous post on how the predictive analytics journey can feel a bit like a roller coaster.

To summarize, the four main phases in building a predictive solution are
- Data munging => Sorrow
- Getting the first model output => Joy
- Realizing that the model still needs a lot of work => Sorrow
- Applying a seemingly imperfect model and seeing impressive customer or business impact => Joy

I also want to add a fifth phase to this - which is deploying the solution. Which in itself can be even more challenging and frustrating than actually building the solution.

In many organizations, the systems supporting the development of the predictive solution and the systems required to implement or deploy the solution are quite different. Analytical data warehouses (mostly the foundation for analytic model builds) are often optimized to generating insights and capturing additional insights for others to leverage. Therefore they contain lots of transformed variables and data. When these analytic stores are used to build models, these transformed variables invariably make their way into the model. Operational systems are optimized for speed of transaction processing and are therefore quite removed from the analytical data. The picture looks somewhat like the one below.
Fig 1: How analytic and operational data can diverge from source systems

This disconnect between analytic systems and operational systems can be a BIG problem when it comes to monetizing the analysis. When you go and try deploy the cool analytical model you just built, it is either really difficult to productionalize - or even worse, it cannot be implemented without considerable watering down.

This could be because:
- The transformations in the analytic data need to be re-done using the operational data, and that introduces delays in the deployment process
- Some of the data elements available in the analytic store might be altogether missing in the operational store

So that's delay in either getting to market or a fraction of the true value gets realized.

As a person who passionately believes in how data and analytics can change consumers life for the better, this is the most frustrating trough in the overall journey. Interestingly this last trough is a massive blind spot in the cognition of several well meaning predictive modelers. Unless you carry the battle scars of trying to deploy these models in production, it is often to even conceptualize that this gap can exist.

Got to remember this though. Unless you are able to deploy your solution to a place where it touches real people, things really don't matter a whole lot. My next post will cover how to close some of these gaps.

Monday, June 1, 2015

Journey through building a Predictive Analytics solution

I have now spent nearly 10 years building predictive models. These have ranged from some detailed segment or population level models using Excel to cutting-edge individual level models for credit risk, response propensity, life-events predictions and so on using statistical packages in R and Python, and back in the days, using SAS. At some point in the last ten years, I also did a foray into text analytics and trying to make sense out of unstructured data.

Building predictive models is tough and takes tme and energy. It is also emotionally consuming and exhausting. Over the years, I have been able to identify four distinct phases through the build-out of any model where I have gone through alternate cycles of depression and joy. Thought I’d share some of that and see whether these cycles resonated with others in the same line of work. And it starts to look somewhat like the picture below.



Phase 1: “Getting the data is really REALLY hard!!!!”

The first phase of depression happens roughly for the first third of any project. You are all excited about this really cool customer problem or business challenge you have in front of you and you have just discovered the perfect predictive modeling solution for the problem. Now all you need to do is to get the data and you will be off to Predictive Modeling Superstardom.

Except that it never happens that way. Data is always messy and needs careful and laborious curation. And if you take shortcuts and are messy about how you handle and manage the data, it almost always comes back to extract its pound of flesh. I am sure there is that phase in the life of every model that you are just frustrated at the amount of time and effort it takes to get data right and even then you are not entirely sure whether you got the data right.

Phase 2: “WOW! I actually built a model and managed to create some predictions!”

The first light at the end of the tunnel happens when you have managed to get the data right, set up the modeling equation correctly (after several attempts at hacking away at the code) and actually ran the model to produce some predictions. And the predictions by-and-large look good and seem to actually make sense when you compare some of the standard metrics such as precision and recall, as well as when create deciles of twenty-tiles of the prediction range and are able to see some decently good predictions! That feeling of excitement and joy is amazingly good.

Phase 3: “Well, actually my predictions aren’t that good!”

The next phase of low happens when you try and examine your predictions at an individual level and discover that - by and large - your predictions are not very accurate at all. Overall, the model seems to work well but at an individual level, the predictions are really off. There are a few cases where the model absolutely nails the prediction but in nearly 60-70% of the cases, the prediction is off in either direction. Not catastrophically off but then off enough to cause some anxiety to the perfectionist in you.

Phase 4: “Phew! Actually the model didn’t do too badly”

Then you actually take the model and apply it to the problem you are trying to solve. So maybe you are looking at customer calls transcripts and trying to predict the likelihood of a follow-up call. Or you are looking at thousands and thousands of loan data and trying to predict the probability of default. Or what a digital experience adoption propensity is likely to look like around thousands of visitors. (Of course, I am assuming all along that the overall problem was structured correctly and the problem you are solving is indeed something worth solving, etc.)

Based on the model, you are hoping to take one or the other action. And then you find that the model actually makes a difference, that you have been able to create happier customers or more profitable loans or an overall better adoption of a digital experience. This feeling - that the predictive modeling solution you built - made an impact in the real world is absolutely spectacular.

I have had these moments of great satisfaction at the end of several such initiatives and there have been certainly situations when things have not exactly followed that script. In my next post, I will talk about the steps that you can take as a predictive modeler, that lead to great outcomes and certain other things that lead to less satisfactory outcomes.

Sitemeter