Big Data Analytics: Exploratory Data Analysis

Showing posts with label Exploratory Data Analysis. Show all posts

Saturday, March 5, 2011

Tips for data mining - Part 2 out of many

Writing after a long time on the blog. Blame it on regular writer's cramp - a marked reluctance and inertia of sorts to put pen to paper, or rather fingers to keyboard.

My last post introduced the idea of defining the problem as the first step for any data mining exercise aiming to achieve success. This is the ability to state the problem you are trying to solve in terms of business outcomes that are measurable. After that comes the step of envisioning the solution and expressing it in a really simple form. The aim should be to create a path from input to output - the output being a set of decisions that will ultimately result in the measurable business outcomes we mentioned above. The next step involves establishing how the developed solution would be used in the real world. Not doing this early enough or with enough clarity could result in the creation of a library curio. Defining how the solution will be used will also point to other needs such as training the users on the right way to use the solution, the expected skills from the end users and so on.

In this post, we will discuss the third and fourth steps. These are
3. Frame the approach before jumping to the actual technical solution
4. Understand the data

Frame the approach before jumping to the actual technical solution. Once the business problem has been defined, it is tempting to point the closest tool at hand at the data and starting to hack away. But often times, the most obvious answer is not necessarily the right answer. It is valuable to construct the nuts and bolts of the approach to get to the solution on a whiteboard or sheet of paper before getting started. Taking the example of some recent text-mining work I have been involved in, one of the important steps was to create an industry-specific lexicon or dictionary. While creating a comprehensive dictionary is often tedious and dull work, this step is an important building block for any data mining effort and hence deserves the upfront attention. We couldn't have seen the value of this step, but for the exercise of comprehensively thinking through the solution. This is also the place where prototyping using sandbox tools like Excel or JMP (the "lite" statistical software from the SAS stable) becomes extremely valuable. Framing this approach in detail allows the data miner to budget for all the small steps along the way that are critical for a successful solution. It also enables putting something tangible in front of decision makers and stakeholders which can be invaluable in getting their buy-in and sponsorship for the solution.

Understand the data. This is such an obvious step that it has almost become a cliche; having said that, incomplete understanding of the data continues to be the reason why the greatest number of data mining projects falter in attempting to fulfill their potential and solve the business goal. Some of the data checks like data distributions, variable attributes like mean, standard deviations, missing rates are quite obvious but I want to call out a couple of critical steps here that might be somewhat non-obvious. The first is to focus extensively on data visualization or exploratory data analysis. In the blog, I have written a few pieces before on data visualization which can be found here. Another good example of this type of visualization is from the Junk Charts blog. The second is to track data lineage - in other words, where did the data come from and how was it gathered. Also is it going to gathered in the same way going forward. This step is important in understanding whether there have been biases in the historical data. There could be coverage bias or responder bias, where people are invited or requested to provide information. In both these cases, the analytical reads are usually specific to the data collected and cannot be easily extrapolated to non-responders or people outside the coverage of the historical data.

This covers the background work that needs to take place before the solution build can be taken up in earnest. In the next few posts, I will share some thoughts on the things to keep in mind while building out the actual data mining solution.

Saturday, December 18, 2010

Visualization of the data and animation - part II

I had written a piece earlier about Hans Rosling's animation of country-level data using the Gapminder tool. Here are some more examples of some extremely cool examples of data animation.

At the start of this series, there is more animation from the Joy Of Stats program that Rosling hosted in the BBC. The landing page is a link that shows the plotting of crime data in downtown San Francisco and how this visual overlay on the city topography provides some valuable insights on where one might expect to find crime. This is a valuable tool for police departments (to try and prevent crime that is local to an area and has some element of predictability), residents (to research neighbourhoods before they buy property, for example) and tourists (who might want to doublecheck a part of the city before deciding on a really attractive Priceline.com hotel deal). The researchers who have created this tool that maps the crime data to maps. The researchers in the clip talk about how tools such as this can be used to improve citizen power and government accountability. Another good example of crime data, this time reported by Police Departments across the US can be found here. Finally, towards the end of the clip, the researchers go on to mention what could be the Holy Grail of this kind of visualization. They talk about how real-time data put up on social media and networking sites like Facebook and Twitter (geo-tagged perhaps) could provide a real-time feed into these maps. Now this would have been certainly in the realm of science fiction only a few years back but suddenly now it doesn't seem as impossible.

The San Francisco crime mapping link has a few other really impressive videos as you scroll further down. I really like the one of Florence Nightingale, whose graphs during the Crimean war helped reveal important insights on how injuries and deaths were occurring in hospitals. It is interesting to know that Lady of the Lantern was not just renowned for tending for the sick, but also was a keen student of statistics. Her graphs of deaths which were accidental, caused by war injuries and wounds and finally those that were preventable (and caused by poor hygiene that was quite prevalent at the time) created a very powerful imagery of the high incidence of preventable deaths and the need to address this area with the right focus.

Why is visualization and animation of data helpful and such a critical tool in the arsenal of any serious data scientist? For a few reasons.

For one, it helps tell a story way better than equations or tables of data do. That is so essential to convey the message to people who are not necessarily experts who have insight into the tables, but are important influencers and stakeholders nevertheless who need to be educated on the subject being conveyed. Think of it as how an advertisement (either picture or moving image) is more powerful in conveying the strength of a brand as compared to boring old text.
The other reason, in my opinion, is that graphical depiction and visualization of the data allows the powerful human brain (which is far more powerful than any computer at pattern recognition) to take over the part of data analysis that the human brain is really good at and computers generally not so good at. This is forming hypotheses on-the-fly about the data being displayed and reaching conclusions based on visual patterns in the data. Also the ability to hook into remote memory banks within our brains and form linkages. While Machine Learning and AI are admirable goals, there is still some way to go before computers can match the sheer ingenuity and flexibility of thought that the human brain possesses.

Monday, August 3, 2009

Why individual level data analysis is difficult

I recently completed a piece of data analysis using individual level data. The project was one of the more challenging pieces of analysis I have undertaken at work, and I was (happily, for myself and everyone else who worked on it) able to turn it into something useful. And there were some good lessons learned at the end of it all, which I want to share in today's post.

So, what is unique and interesting about individual level data and why is it valuable?
- With any dataset that you want to derive insights from, there are a number of attributes about the unit being analyzed (individual, group, state, country) and one or more attributes that you are trying to explain. Let's call the first group predictors and the second group target variables. Individual level data has a much wider range of predictor and target variables. There is also a much wider range of interactions between these various predictors. For example, while on an average, older people tend to be wealthier, individual level data reveals that there are older people who are broke and younger people who are millionaires. As a result of these wide ranges of data and the different types of interactions between these variables (H-L, M-M, M-H, L-H ... you get the picture), it is possible to understand fundamental relationships between the predictors and the targets and interactions between the predictors. Digging a little deeper into the people vs wealth data, what this might tell you is that what really matters for your level of wealth is your education levels, the type of job you do, etc. This level of variation is not available with the group level data. In other words, the group level data is just not as rich.
- Now, along with the upside comes downside. The richness of the individual level predictors means that data occassionally is messy. What is messy? Messy means having wrong values at an individual level, sometimes missing or null values at an individual level. At a group level, many of the mistakes average themselves out, especially if the errors are distributed evenly around zero. But at the individual levels, the likelihood of errors has to be managed as part of the analysis work. With missing data, the challenge is magnified. Is missing data truly missing? Or did it get dropped during some data gathering step? Is there something systematic to missing data, or is it random? Should missing data be treated as missing or should it be imputed to some average value? Or should it be imputed to a most likely value? These are all things that can materially impact the analysis and therefore should be given due consideration.

Now to the analysis itself. What were some of my important lessons?
- Problem formulation has to be crystal clear and that in turn should drive the technique.
Problem formulation is the most important step of the problem solving exercise. What are we trying to do with the data? Are we trying to build a predictive model with all the data? Are we examining interactions between predictors? Are we studying the relationship between predictors one at a time and the target? All of these outcomes require different analytical approaches. Sometimes, analysts learn a technique and then look around for a nail to hit. But judgment is needed to make sure the appropriate technique is used. The analyst needs to have the desire to learn to use an technique that he/she is not aware of. By the same token, discipline to use a simpler technique where appropriate.

- Spending time to understand the data is a big investment that is completely worth it.
You cannot spend too much time understanding the data. Let me repeat that for effect. You cannot spend too much time understanding the data. And I have come to realize that far from being a drudge, understanding the data is one of the most fulfilling and value added pieces of any type of analysis. The most interesting part of understanding data (for me) is the sheer number of data points that are located so far away from the mean or median of the sample. So if you are looking at people with mortgages and the average mortgage amount is $150,000, the number of cases where the mortgage amount exceeds $1,000,000 lends a completely new perspective of the type of people in your sample.

- Explaining the results in a well-rounded manner is a critical close-out at the end.
The result of a statistical analysis is usually a set of predictors which have met the criteria for significance. Or it could be a simple two variable correlation that is above a certain threshold. But whatever be the results of the analysis, it is important to base the analysis result in real-life insights that can be understood by the audience. So, if the insight reveals that people with large mortgages have a higher propensity to pay off their loans, further clarification will be useful around the income level of these people, their education levels, the types of jobs they hold, etc. All these ancillary data points are ways of closing out the profile of the "thing" that has been revealed by the analysis.

- Speed is critical to get the results in front of the right people before they lose interest.
And finally, if you are going to spend a lot of everyone's (and your) precious time doing a lot of the above, the results need to be driven in extra-short time for people to keep their interest in what you are doing. In today's information-saturated world, it only takes the next headline in the WSJ for people to start talking about something else. So, you need to basically do the analysis in a smart manner, and also it needs to be super-fast. Delivered yesterday, as the cliche goes.

In hindsight, it gives me an appreciation of why data analysis or statistical analysis using individual level data is one of the more challenging analytical exercises. And why it is so difficult to get it right.

Saturday, July 18, 2009

Data visualization

An example of a really well-done graphic is from the NOAA website. Science and particularly math afficionados seem to have a particular affinity to following weather science. (I am wondering whether it is a visceral reaction to global warming naysayers who, the scientists think, are possibly insulting their learning.)

The graph is a world temperature graph and this type of graph has come in so many different forms, it is difficult not to have seen such a graph. What I like about this is the elegant and non-intrusive form in which the overlays are done.
• By using dots and varying the size of the dots, the creator of the graph is making sure that the underlying geographic details (important in a world map where there is great detail that needs to be captured in a small area, therefore you cannot use very thick lines for country borders) still come through.
• The other thing that I liked is some of the simplications the creator has made. The dots are equally spaced but I am pretty sure that’s exactly not how the data was gathered. But to tell the story, that detail is not as important.

The graphic came from Jeff Masters' weather blog which is one of the best of its kind. Here's a link if you are interested.

Friday, May 29, 2009

Data handling - the heart of good analysis - Part 1

I have been building consumer behaviour models using regression and classification tree techniques for the last 4 years now. Most of this work has been in SAS. Now, there are a large number of interesting SAS procedures that are only slightly different from one another. Many of them can be interchangeable used, like PROC REG and PROC GLM.

But the single most important learning for me over this period has been that you can't spend enough time understanding and transforming the data. Very many interesting and potentially promising pieces of analysis go nowhere because the researcher has not enough time understanding the data. And then, having understood the data, transformed it into a form that is relevant to the problem at hand.

One of the seminal pieces on understanding data and plotting it in useful ways, is John Tukey's "Exploratory Data Analysis". This paper introduces some unique and important ways of graphing and understanding what the data is trying to say. One of my personal favorite SAS procedures is PROC MEANS and PROC UNIVARIATE. And of course PROC GPLOT. My advice to the budding social scientist and quantitative practitioner is to learn to use these techniques before learning the cooler procedures like LOGISTIC and Linear Models. This was one of the first things I learnt in my own journey as a statistical modeler and I have some very good and experienced colleagues to thank for making sure I learnt the basics first.

Over the next several posts, I am going to share some of my favorite forms of data depiction. The next several days will be a very interesting read, I promise.