Monday, August 3, 2009

Why individual level data analysis is difficult

I recently completed a piece of data analysis using individual level data. The project was one of the more challenging pieces of analysis I have undertaken at work, and I was (happily, for myself and everyone else who worked on it) able to turn it into something useful. And there were some good lessons learned at the end of it all, which I want to share in today's post.

So, what is unique and interesting about individual level data and why is it valuable?
- With any dataset that you want to derive insights from, there are a number of attributes about the unit being analyzed (individual, group, state, country) and one or more attributes that you are trying to explain. Let's call the first group predictors and the second group target variables. Individual level data has a much wider range of predictor and target variables. There is also a much wider range of interactions between these various predictors. For example, while on an average, older people tend to be wealthier, individual level data reveals that there are older people who are broke and younger people who are millionaires. As a result of these wide ranges of data and the different types of interactions between these variables (H-L, M-M, M-H, L-H ... you get the picture), it is possible to understand fundamental relationships between the predictors and the targets and interactions between the predictors. Digging a little deeper into the people vs wealth data, what this might tell you is that what really matters for your level of wealth is your education levels, the type of job you do, etc. This level of variation is not available with the group level data. In other words, the group level data is just not as rich.
- Now, along with the upside comes downside. The richness of the individual level predictors means that data occassionally is messy. What is messy? Messy means having wrong values at an individual level, sometimes missing or null values at an individual level. At a group level, many of the mistakes average themselves out, especially if the errors are distributed evenly around zero. But at the individual levels, the likelihood of errors has to be managed as part of the analysis work. With missing data, the challenge is magnified. Is missing data truly missing? Or did it get dropped during some data gathering step? Is there something systematic to missing data, or is it random? Should missing data be treated as missing or should it be imputed to some average value? Or should it be imputed to a most likely value? These are all things that can materially impact the analysis and therefore should be given due consideration.

Now to the analysis itself. What were some of my important lessons?
- Problem formulation has to be crystal clear and that in turn should drive the technique.
Problem formulation is the most important step of the problem solving exercise. What are we trying to do with the data? Are we trying to build a predictive model with all the data? Are we examining interactions between predictors? Are we studying the relationship between predictors one at a time and the target? All of these outcomes require different analytical approaches. Sometimes, analysts learn a technique and then look around for a nail to hit. But judgment is needed to make sure the appropriate technique is used. The analyst needs to have the desire to learn to use an technique that he/she is not aware of. By the same token, discipline to use a simpler technique where appropriate.

- Spending time to understand the data is a big investment that is completely worth it.
You cannot spend too much time understanding the data. Let me repeat that for effect. You cannot spend too much time understanding the data. And I have come to realize that far from being a drudge, understanding the data is one of the most fulfilling and value added pieces of any type of analysis. The most interesting part of understanding data (for me) is the sheer number of data points that are located so far away from the mean or median of the sample. So if you are looking at people with mortgages and the average mortgage amount is $150,000, the number of cases where the mortgage amount exceeds $1,000,000 lends a completely new perspective of the type of people in your sample.

- Explaining the results in a well-rounded manner is a critical close-out at the end.
The result of a statistical analysis is usually a set of predictors which have met the criteria for significance. Or it could be a simple two variable correlation that is above a certain threshold. But whatever be the results of the analysis, it is important to base the analysis result in real-life insights that can be understood by the audience. So, if the insight reveals that people with large mortgages have a higher propensity to pay off their loans, further clarification will be useful around the income level of these people, their education levels, the types of jobs they hold, etc. All these ancillary data points are ways of closing out the profile of the "thing" that has been revealed by the analysis.

- Speed is critical to get the results in front of the right people before they lose interest.
And finally, if you are going to spend a lot of everyone's (and your) precious time doing a lot of the above, the results need to be driven in extra-short time for people to keep their interest in what you are doing. In today's information-saturated world, it only takes the next headline in the WSJ for people to start talking about something else. So, you need to basically do the analysis in a smart manner, and also it needs to be super-fast. Delivered yesterday, as the cliche goes.

In hindsight, it gives me an appreciation of why data analysis or statistical analysis using individual level data is one of the more challenging analytical exercises. And why it is so difficult to get it right.

No comments:

Sitemeter