Saturday, January 15, 2011

Tips for data mining - part 1 out of many!

Having spent a good part of the last four years in mining data and working in general in the business/ predictive analytics areas, I thought I'd take a step back and summarize some of my lessons learnt through data mining. I was inspired to do this based on a very revealing article by Tom Breur, principal at XLT Consulting. More on Tom and his writings later.

So what have I worked on these past years? As a management consultant in my previous life, working with data and tons of it was a given. Using data extensively, building business analytics models which aim to replicate real world processes, establishing objective criteria to take decisions is all bread-and-butter ways of problem solving in the consulting world. In fact, I'd go as far to say that consultants seriously suffer from a lack of self-confidence when they consider out of the box solutions that do not include any/ all of the above. My consulting experience was mainly in the area of consumer goods distribution and marketing. Then moved to retail financial services. The main area of experience there has been in credit risk modeling and consumer cash flow modeling. Modeling how much of these events (credit risk, cash flow) are driven by internal factors and how much by exogenous occurences. Also modeling consumer response to marketing products. A recent foray has been into text mining unstructured responses from applicants.

So what have I learnt? I will try and summarize in a few posts, with potential reader fatigue in due consideration.At a summary level, these are the eight steps to data mining salvation.
1. Define the problem and the design of the solution
2. Establish how the tool you are building is going to be used
3. Frame the approach before jumping to the actual technical solution
4. Understand the data
5. Beware the "hammer looking for a nail"
6. Validate your solution
7. Beware the "smoking gun"
8. Establish the value upside and generate buy-in

I will tackle each of these steps in some detail now.

1. Define the problem and the design of the solution
This is the first step. Define the problem that you are really trying to solve. The key here to defining the problem well is to frame the solution in terms of business outcomes.

Complete the following sentence. "Solving this problem will lead to x% increase in sales, y% decrease in costs, a multiplier of efficiency and speed by z" etc. If this sentence does not flow easily, then I am afraid you have not spent the time defining the overall problem well enough.

Understand the context surrounding the problem, why it has been difficult to crack over the years, where is the data going to come from, where has the data come in the past and are there going to be changes to how it will be available in the future? Speak to others who have taken a crack at this problem and their view on where the constraints lie.

Once you have defined the problem well enough, envision what the solution is going to look like. What are the parts of the solution that are pure process excellence, where do you need advanced analytics, so that the place for the analytical piece of the solution (something that the reader of this blog is primarily interested in) is clearly established. The aim should be to create a simple block diagram on how one goes from input (usually data) to output (ideally, a set of decisions) and what are all the pieces that come in between.

2. Establish how the tool you are building is going to be used
In the previous step, the problem has been defined and the solution has been scoped at a high level. Then the next step is to put some detail into how the analytical solution is going to be actually used. Will the model be used mainly for understanding purposes or for doing exploratory analysis? And therefore the results of the analysis implemented using some simple decisioning rules or heuristics. Or is the desire or the plan to use the model in the "live" production environment for decision making? It is important to get good answers or at least good likely answers to all of these questions because they play a very important role in determining the actual tools that will be used to build the solution, the checks and audits that need to be put in the overall system of decision making, the process of overrides, the infrastructure and technology needed to make the solution effective and so on. Also the people aspect needs to be considered at this step. The use-conditions of the tool will determine the type of user training that needs to be provided, also the skills of the end-users that needs to be ensured and how much of the skills can be imparted by on-the-job training vs what skills are entry conditions into the job.

More about all of this in subsequent posts.

Thursday, January 13, 2011

The Joy of Stats is finally online

My first post of 2011. I have been writing this blog for nearly two years now and am happy to keep having the energy and the enthusiasm to keep at it. Like I have said earlier on why I blog, this is way for me to keep abreast of the latest development in the fields of data mining, analytics and visualization.

The Joy of Stats program was aired in BBC4 in December 2010. Now the video is available on Hans Rosling's Gapminder website. This was the program from which the data visualization examples used for mapping San Francisco crimes, the graphics made by Florence Nightingale and the Gapminder visualization of the economic and demographic statistics of different countries over the last 200 years are highlighted.

Another example of nifty graphics. The New York Times has been a trendsetter in putting up very clever and informative graphics supporting its news stories. Amanda Cox of the NYT graphics department did a presentation recently on some of the examples that the NYT has used in its print as well as its online media. This is a long presentation but worth sitting through.

Hopefully you will enjoy both these presentations!