Saturday, March 5, 2011

Tips for data mining - Part 2 out of many

Writing after a long time on the blog. Blame it on regular writer's cramp - a marked reluctance and inertia of sorts to put pen to paper, or rather fingers to keyboard.

My last post introduced the idea of defining the problem as the first step for any data mining exercise aiming to achieve success. This is the ability to state the problem you are trying to solve in terms of business outcomes that are measurable. After that comes the step of envisioning the solution and expressing it in a really simple form. The aim should be to create a path from input to output - the output being a set of decisions that will ultimately result in the measurable business outcomes we mentioned above. The next step involves establishing how the developed solution would be used in the real world. Not doing this early enough or with enough clarity could result in the creation of a library curio. Defining how the solution will be used will also point to other needs such as training the users on the right way to use the solution, the expected skills from the end users and so on.

In this post, we will discuss the third and fourth steps. These are
3. Frame the approach before jumping to the actual technical solution
4. Understand the data

Frame the approach before jumping to the actual technical solution. Once the business problem has been defined, it is tempting to point the closest tool at hand at the data and starting to hack away. But often times, the most obvious answer is not necessarily the right answer. It is valuable to construct the nuts and bolts of the approach to get to the solution on a whiteboard or sheet of paper before getting started. Taking the example of some recent text-mining work I have been involved in, one of the important steps was to create an industry-specific lexicon or dictionary. While creating a comprehensive dictionary is often tedious and dull work, this step is an important building block for any data mining effort and hence deserves the upfront attention. We couldn't have seen the value of this step, but for the exercise of comprehensively thinking through the solution. This is also the place where prototyping using sandbox tools like Excel or JMP (the "lite" statistical software from the SAS stable) becomes extremely valuable. Framing this approach in detail allows the data miner to budget for all the small steps along the way that are critical for a successful solution. It also enables putting something tangible in front of decision makers and stakeholders which can be invaluable in getting their buy-in and sponsorship for the solution.

Understand the data. This is such an obvious step that it has almost become a cliche; having said that, incomplete understanding of the data continues to be the reason why the greatest number of data mining projects falter in attempting to fulfill their potential and solve the business goal. Some of the data checks like data distributions, variable attributes like mean, standard deviations, missing rates are quite obvious but I want to call out a couple of critical steps here that might be somewhat non-obvious. The first is to focus extensively on data visualization or exploratory data analysis. In the blog, I have written a few pieces before on data visualization which can be found here. Another good example of this type of visualization is from the Junk Charts blog. The second is to track data lineage - in other words, where did the data come from and how was it gathered. Also is it going to gathered in the same way going forward. This step is important in understanding whether there have been biases in the historical data. There could be coverage bias or responder bias, where people are invited or requested to provide information. In both these cases, the analytical reads are usually specific to the data collected and cannot be easily extrapolated to non-responders or people outside the coverage of the historical data.

This covers the background work that needs to take place before the solution build can be taken up in earnest. In the next few posts, I will share some thoughts on the things to keep in mind while building out the actual data mining solution.