Sunday, June 26, 2011

Simple vs complex models

I came across the subject of simple models (or crude models - somehow didn't like the word crude, it sounded .... well, crude) vs more complicated models in this very interesting blog by John D Cook called "The Endeavour". The link to the article is here.  There was a good discussion on the pros and cons between simple and complex models, and so I thought I'd add some of my own thoughts on the matter.

First, in terms of definitions. We are talking about reduced-form predictive analytics models here. Simple or crude models are ones that use a few number of predictors and interactions between them, complex models are ones that use many more predictors and get closer to that line between a perfectly specified and an overspecified model. John Cook makes the article come alive with an interesting quote from Schumpeter … there is an awful temptation to squeeze the lemon until it is try and to present a picture of the future which through its very precision and verisimilitude carries conviction.


Simple models have their benefits and uses. They are usually quicker to build, easier to implement, easier to interpret and to update. I particularly like the easier to implement and easier to interpret/ update bits. I have seldom come across a model that was so good and so reliable that it needed no interpretation or updating. The fact of the matter is that any model captures some of the peculiarities in the training data set used to build the model, and therefore, by definition is somewhat over-specified for tht dataset. There is never a sample that is a perfect microcosm of the world - if there were, it wouldn't be a sample at all and rather, it would be almost the world that it is supposed to be a representation of. So any sample and therefore any model built off it is going to have biases. A model builder therefore would be well served to understand and mitigate those biases and build an understanding that is more robust and less cute.

Also, the implementation of the model should be straightforward. The model complexity should not lead to implementation headaches, whose resolution end up costing a significant portion of the purported benefits from the model.

Another reason why I prefer simpler models is their relative transparency when it comes to their ultimate use. Models invariably get applied in contexts way different from what they were designed to do. They are frequently scored on different populations (i.e., different from the training set) and used to make predictions and decisions that again are far removed from what they were originally intended to do. In those situations, I eminently prefer having the ability to understand what the model is saying and why, and then apply the corrections that my world experience and intuition tells me. Versus relying on a model that is such a "black box" that it is impossible to understand and therefore leads to this very dangerous train of thought that says "if it is so complex, it must be right".

No comments:

Sitemeter