Big Data Analytics: June 2011

Sunday, June 26, 2011

Simple vs complex models

I came across the subject of simple models (or crude models - somehow didn't like the word crude, it sounded .... well, crude) vs more complicated models in this very interesting blog by John D Cook called "The Endeavour". The link to the article is here. There was a good discussion on the pros and cons between simple and complex models, and so I thought I'd add some of my own thoughts on the matter.

First, in terms of definitions. We are talking about reduced-form predictive analytics models here. Simple or crude models are ones that use a few number of predictors and interactions between them, complex models are ones that use many more predictors and get closer to that line between a perfectly specified and an overspecified model. John Cook makes the article come alive with an interesting quote from Schumpeter … there is an awful temptation to squeeze the lemon until it is try and to present a picture of the future which through its very precision and verisimilitude carries conviction.

Simple models have their benefits and uses. They are usually quicker to build, easier to implement, easier to interpret and to update. I particularly like the easier to implement and easier to interpret/ update bits. I have seldom come across a model that was so good and so reliable that it needed no interpretation or updating. The fact of the matter is that any model captures some of the peculiarities in the training data set used to build the model, and therefore, by definition is somewhat over-specified for tht dataset. There is never a sample that is a perfect microcosm of the world - if there were, it wouldn't be a sample at all and rather, it would be almost the world that it is supposed to be a representation of. So any sample and therefore any model built off it is going to have biases. A model builder therefore would be well served to understand and mitigate those biases and build an understanding that is more robust and less cute.

Also, the implementation of the model should be straightforward. The model complexity should not lead to implementation headaches, whose resolution end up costing a significant portion of the purported benefits from the model.

Another reason why I prefer simpler models is their relative transparency when it comes to their ultimate use. Models invariably get applied in contexts way different from what they were designed to do. They are frequently scored on different populations (i.e., different from the training set) and used to make predictions and decisions that again are far removed from what they were originally intended to do. In those situations, I eminently prefer having the ability to understand what the model is saying and why, and then apply the corrections that my world experience and intuition tells me. Versus relying on a model that is such a "black box" that it is impossible to understand and therefore leads to this very dangerous train of thought that says "if it is so complex, it must be right".

Friday, June 17, 2011

Tips for data mining - Part 3 out of 4

I had written two pieces early on in the year about data mining tips. These talked about the first four tips to keep in mind while undertaking any data-mining project.
1. Define the problem and the design of the solution
2. Establish how the tool you are building is going to be used
3. Frame the approach before jumping to the actual technical solution
4. Understand the data

The links for the first two parts can be found here and here. Now let me talk about the fifth and sixth parts on what I have learned.

5. Beware the "hammer looking for a nail"
This lesson basically recommends that you make sure you are using the appropriate complexity and sophistication of the analytical solution for the problem at hand. It is very easy to get excited about any one analytical solution and try to apply it to every single problem that you come across. But approaching a business problem like a "hammer looking for a nail" creates a set of issues. One, application of the technique becomes more important that understanding the problem. When that happens, the desire to implement the technique successfully becomes more important than solving the problem specifically. Two, the solution can sometimes reach a level of complexity that the problem does not really need - on several occasions, simple solutions work the best.

6. Validate your solution
One of the most common mistakes that a data miner can make when confronted with a problem is to produce an overfit model. An overfit model is an overspecified model - many of the relationships between the predictor and target variables implied by the model are not real. They are an artifact of the dataset used to build the model. The problem with overfit models is that they tend to fail spectacularly when applied to a different situation. Therefore, it is crucial to do out-of-sample validation of the model. If the model does not do a good job validating on the validation sample, it usually means an overspecified model. (Holdouts from the original build dataset - the typical two-third/ one-third breakup between the build and validation datasets - don't really result in an independent validation.) The model might need to be simplified. One way to do it is to examine all the relationships between the predictor variables and the target variable and make sure they are sensible and believable. Another way of simplification is to make sure only linear relationships are maintained in the model. To be clear, this is often a gross oversimplification - but it is sometimes better than an overspecified model that is unusable.

So this was tip #5 and 6. I will soon close out with the last two tips. Thanks for the patience with my slow pace of writing on this. This is the link to Tom Breuer's tips on building predictive models. A lot of good ideas here as well.

Wednesday, June 1, 2011

Principal Components - the math behind it

A really delightful tutorial on the mathematical basis for Principal Components Analysis. It really clarified a lot of the basics for me.