Big Data Analytics: Stress testing your model

We discussed two techniques of ensuring the robustness of models in two previous posts. In the first post, we discussed out-of-sample validation. In the second post, we discussed sensitivity analysis. I find sensitivity analysis to be a really valuable technique for ensuring the robustness of model outputs and decisions driven by models - but only when it is done right.

Another and a more computing-intensive technique of ensuring model output robustness is Monte Carlo simulation. Monte Carlo simulation basically involves running the models literally thousands of time and changing each of the inputs a little with every run. With advances in computing power and the power being within reach of most modelers and researchers, it has become fairly easy to set up and run the simulation.

So let's say, we have a model with 3 inputs. And now let's assume that the inputs are varied in 10 steps over its entire valid range. So now the model will produce 1000 different outputs for various values of inputs (1000 = 10 x 10 x 10), each output having a theoretical probability of 0.001.

How are the inputs varied?
Typically using a distribution that varies the inputs in a probabilistic manner. The input distribution is the most important assumption that goes into the Monte Carlo simulation. The typical approach is to assume that most events are normally distributed. But the reality is that normal distribution is usually observed only in natural phenomena. In most business applications, distributions are usually skewed in one direction. (Take loan sizes on a financial services product, like a credit card. The distribution is always skewed towards the higher side, as balances cannot be less than zero but can take really large positive values.)

Correlation or covariance of the inputs
In a typical business model, inputs are seldom independent; they have various degrees of correlation. It is important to keep this correlation in mind while running the scenarios. By factoring in covariance of inputs explicitly while running the simulation, the output is probabilistically weighted towards results which occur when the inputs are correlated.

Of course, as with any piece of modeling, there are ways in which this technique can be misused. Some of my pet gripes about MC simulation will form the subject of a later post.

Big Data Analytics

Friday, June 12, 2009

Stress testing your model - Part 3/3

No comments:

Sitemeter