Wednesday, August 3, 2011

Tips for data mining - part 4 out of 4

My labor of love which started nearly seven months back is finally drawing to a close. In previous pieces, I have talked about some of the lessons I have learned in the field of data mining. The first two pieces of advice which were covered in this post were
1. Define the problem and the design of the solution
2. Establish how the tool you are building is going to be used

The next pieces were covered in this post and they were
3. Frame the approach before jumping to the actual technical solution
4. Understand the data

In the third post in this epic story (and it has really started feeling like an epic, even though it has just been three medium length posts so far), I covered:
5. Beware the "hammer looking for a nail
6. Validate your solution

Now based on everything I have talked about so far, you actually go and get some data and build a predictive model. The model seems to be working exceptionally well and showing high goodness-of-fit with the data. And there, you have reached the seventh lesson about data mining which is

7. Beware the "smoking gun"
Or, when something is too good to be true, it probably is not true. When the model is working so well that it seems to be answering every question that is being asked, there is something insidious going on - the model is not really predicting anything but just transferring the input straight through to the output. It could be that a field that is another representation of the target variable is used as a predictor. Lets take an example here. Let us say we are trying to predict the likelihood that a person is going to close their cellphone plan, or in business parlance, the likelihood that the customer is going to attrite. Also, let's say one of the predictors used is whether someone called up the service cancellation queue through customer service. By using the "called service cancellation queue" as a predictor, we are in effect using the outcome of the call (service cancellation) as both a predictor as well as the target variable. Of course the model is going to slope extremely nicely and put everyone who met the service cancellation queue condition as the ones most likely to attrite. This is an example of a spurious model, it is not even a bad or an overfit model. Not understanding the different predictors available (or rather not paying attention to the way the data is being collected) and providing justification as to why they are being selected as a predictor in the predictive model is the most common reason why spurious models get built. So when you see something too good to be true, watch out.

8. Establish the value upside and generate buy-in
Now lets say you manage to avoid the spurious model trap and actually build a good model. A model that is not overspecified, independently validated and using a set of predictors that are tested for quality and have been well understood by the modeler (you). Now the model should be translated to business value in order to get the support of the different business stakeholders who are going to use the model or will need to support the deployment of the model. A good understanding of the economics of the underlying business model is required to value the greater predictive capability afforded by the model. It is usually not too difficult to come up with this value estimate, but this might seem like an extra step at the end of a long and arduous model build. But this is a critically important step to get right. Hard-nosed business customers are not likely to be impressed by the technical strengths of the model - they will want to know how this adds business value and either increases revenue, reduces costs or decreases the exposure to unexpected losses or risk.

So, there. A summary of all that I have learned in the last 4-5 years of being very close to predictive analytics and data mining. It was enjoyable writing this down (even if it took seven months) and I hope the aspiring data scientist gets at least a fraction of the enjoyment I have had in writing this piece.


Randall G. Smith said...

Thank you!

Andre A. Sandusky said...

Now mostly people who are doing stock exchange work in different markets are interested in data mining (due to bitcoins or other cryphtocurreny) but I am excited to read out such wonderful and amazing tips that you've shared with us. Here I just offer you with inductive data analysis that is best option for all.