Big Data Analytics: Tips for data mining

I had written two pieces early on in the year about data mining tips. These talked about the first four tips to keep in mind while undertaking any data-mining project.
1. Define the problem and the design of the solution
2. Establish how the tool you are building is going to be used
3. Frame the approach before jumping to the actual technical solution
4. Understand the data

The links for the first two parts can be found here and here. Now let me talk about the fifth and sixth parts on what I have learned.

5. Beware the "hammer looking for a nail"
This lesson basically recommends that you make sure you are using the appropriate complexity and sophistication of the analytical solution for the problem at hand. It is very easy to get excited about any one analytical solution and try to apply it to every single problem that you come across. But approaching a business problem like a "hammer looking for a nail" creates a set of issues. One, application of the technique becomes more important that understanding the problem. When that happens, the desire to implement the technique successfully becomes more important than solving the problem specifically. Two, the solution can sometimes reach a level of complexity that the problem does not really need - on several occasions, simple solutions work the best.

6. Validate your solution
One of the most common mistakes that a data miner can make when confronted with a problem is to produce an overfit model. An overfit model is an overspecified model - many of the relationships between the predictor and target variables implied by the model are not real. They are an artifact of the dataset used to build the model. The problem with overfit models is that they tend to fail spectacularly when applied to a different situation. Therefore, it is crucial to do out-of-sample validation of the model. If the model does not do a good job validating on the validation sample, it usually means an overspecified model. (Holdouts from the original build dataset - the typical two-third/ one-third breakup between the build and validation datasets - don't really result in an independent validation.) The model might need to be simplified. One way to do it is to examine all the relationships between the predictor variables and the target variable and make sure they are sensible and believable. Another way of simplification is to make sure only linear relationships are maintained in the model. To be clear, this is often a gross oversimplification - but it is sometimes better than an overspecified model that is unusable.

So this was tip #5 and 6. I will soon close out with the last two tips. Thanks for the patience with my slow pace of writing on this. This is the link to Tom Breuer's tips on building predictive models. A lot of good ideas here as well.

Big Data Analytics

Friday, June 17, 2011

Tips for data mining - Part 3 out of 4

2 comments:

Sitemeter