Big Data Analytics: Model Validation

Showing posts with label Model Validation. Show all posts

Wednesday, August 3, 2011

Tips for data mining - part 4 out of 4

My labor of love which started nearly seven months back is finally drawing to a close. In previous pieces, I have talked about some of the lessons I have learned in the field of data mining. The first two pieces of advice which were covered in this post were
1. Define the problem and the design of the solution
2. Establish how the tool you are building is going to be used

The next pieces were covered in this post and they were
3. Frame the approach before jumping to the actual technical solution
4. Understand the data

In the third post in this epic story (and it has really started feeling like an epic, even though it has just been three medium length posts so far), I covered:
5. Beware the "hammer looking for a nail
6. Validate your solution

Now based on everything I have talked about so far, you actually go and get some data and build a predictive model. The model seems to be working exceptionally well and showing high goodness-of-fit with the data. And there, you have reached the seventh lesson about data mining which is

7. Beware the "smoking gun"
Or, when something is too good to be true, it probably is not true. When the model is working so well that it seems to be answering every question that is being asked, there is something insidious going on - the model is not really predicting anything but just transferring the input straight through to the output. It could be that a field that is another representation of the target variable is used as a predictor. Lets take an example here. Let us say we are trying to predict the likelihood that a person is going to close their cellphone plan, or in business parlance, the likelihood that the customer is going to attrite. Also, let's say one of the predictors used is whether someone called up the service cancellation queue through customer service. By using the "called service cancellation queue" as a predictor, we are in effect using the outcome of the call (service cancellation) as both a predictor as well as the target variable. Of course the model is going to slope extremely nicely and put everyone who met the service cancellation queue condition as the ones most likely to attrite. This is an example of a spurious model, it is not even a bad or an overfit model. Not understanding the different predictors available (or rather not paying attention to the way the data is being collected) and providing justification as to why they are being selected as a predictor in the predictive model is the most common reason why spurious models get built. So when you see something too good to be true, watch out.

8. Establish the value upside and generate buy-in
Now lets say you manage to avoid the spurious model trap and actually build a good model. A model that is not overspecified, independently validated and using a set of predictors that are tested for quality and have been well understood by the modeler (you). Now the model should be translated to business value in order to get the support of the different business stakeholders who are going to use the model or will need to support the deployment of the model. A good understanding of the economics of the underlying business model is required to value the greater predictive capability afforded by the model. It is usually not too difficult to come up with this value estimate, but this might seem like an extra step at the end of a long and arduous model build. But this is a critically important step to get right. Hard-nosed business customers are not likely to be impressed by the technical strengths of the model - they will want to know how this adds business value and either increases revenue, reduces costs or decreases the exposure to unexpected losses or risk.

So, there. A summary of all that I have learned in the last 4-5 years of being very close to predictive analytics and data mining. It was enjoyable writing this down (even if it took seven months) and I hope the aspiring data scientist gets at least a fraction of the enjoyment I have had in writing this piece.

Sunday, June 7, 2009

Stress testing your model - Part 2/3

Continuing on the topic of risk management for models. After building a model, how do you make sure the model remains robust under working conditions? More importantly, make sure it works well under extreme conditions? We discussed the importance of independent validation for empirical models in a previous post. In my experience, model failures have been frequent when the validation process has been superficial.

Now, I want to move on to sensitivity analysis. Sensitivity analysis involves understanding the variability of the model output as the inputs to the model are varied. The inputs are changed by + or - 10 to 50% and the output is recorded. The range of outputs gives a sense of the various outcomes that can be anticipated and one needs to prepare for. Sensitivity analysis can also be used to stress test the other components of the system which the model drives. For example, let's say the output of the model is a financial forecast that goes into a system that is used to drive, deposit generation. The sensitivity analysis output gives an opportunity to check the robustness of the downstream system. By knowing that one might require occasionally to generate deposits at 4-5 times the usual monthly volumes, one can prepare accordingly.

Now, sensitivity analysis is one piece of stress testing that has usually been misdirected and incomplete. Good sensitivity analysis looks at both the structural components of the model as well as the inputs to the model. Most sensitivity analysis I have encountered stress only the structural components. What is the difference between the two?

Let's say, you have a model to project the performance of the balance sheet of a bank. One of the typical stresses that one would apply is to the expected level of losses on the loan portfolio of the bank. A stress of 20-50% and sometimes even 100% increase in losses is applied and the model outputs are assessed. When this is done consistently with all the other components of the balance sheet, you can get a sense of the sensitivity of the model to various components.

But that's not the same as the sensitivity to inputs. Because inputs are based in real-world phenomena, their impact is usually spread out to multiple components in the model. For example, if the 100% increase in losses were driven by a recession in the economy, there would be other impacts that one would need to worry about. Now, a recession is usually accompanied by a flight to quality from investors. So if there is a recessionary outlook, the value of equity holdings could crash as well due to equity investors moving out from equities (selling) and into more stable instruments. A third impact could be the impact of higher capital requirements on the value of traded securities . As other banks face the same recessionary environment, their losses could increase to such an extent that a call to increase capital becomes inevitable. How does one increase capital? The easiest route is to liquidate existing holdings. Driving a greater fall in the market prices of traded securities. Thus, putting further stresses on the balance sheet.

So, the scenario of running a 50% increase in loan losses is a purely illusory one. When loan losses increase, one has to contend with what the fundamental driver could be and how can that fundamental driver impact other portions of the balance sheet.

The other place where sensitivity analysis is often incomplete is by not looking at the impact of upstream and downstream processes and strategies. A model is never a stand-alone entity. It has upstream sources of data and down-stream uses of the model output. So if the model has to face situations where there are extreme values of inputs, what could be the implications on upstream and downstream strategies? These are the questions any serious model builder should be asking.

This discussion on sensitivity analysis has hopefully been eye-opening to modeling practitioners. Now, we will go on to a third technique, Monte Carlo simulation in another post. But before we go there, what are other examples of sensitivity analyses that you have seen in your work? How has this analysis been used effectively (or otherwise)? What are good graphical ways of sharing the output?

Saturday, June 6, 2009

Stress testing your model - Part 1/3

So, you've built a model. You have been careful about understanding your data, transforming it appropriately, used the right modeling technique, done an independent validation (if it is an empirical model) and now you are ready to use the model to make forecasts, drive decisions, etc.

Wait. Not so fast. Before the model is ready for prime time, you need to make sure that the model is robust. What defines a robust model?
- the inputs should cover not just the expected events but also extreme events
- the model should not break down (i.e., mispredict) when the inputs turn extreme (Well, no model can be expected to perform superbly when the inputs turn extreme. If models could do that, the events wouldn't be termed extreme events. But the worst thing that a model can do is provide an illusion of normal (english usage) outputs when the inputs are extreme.

I want to share some of the techniques that are used for understanding the robustness of the models, what I like about them and what I don't.

1. When it comes to empirical models, one of the most useful techniques is Out of Sample Validation. This is done by building the model on one data set and validating the algorithm on another. For the truest validation, the validation dataset should be independent of the build, should be drawn from a different time period. "Check-the-box" type validation is when you validate the model on a portion of the build sample itself. Such validation often holds and just as often offers a false sense of security, because in real terms, you have really not validated anything.
Caveat: Out of sample validation is of no use if the future is going to look very different from the past. Validating a model to predict the probability of mortgage default using conventional mortgages data would have been of no use in a world where no-documentation mortgages and other exotic-term mortgages were being marketed.

The other two approaches I want to discuss are Sensitivity Analysis and Monte Carlo Simulation. I will cover them in subsequent posts.