Friday, May 29, 2009

Data handling - the heart of good analysis - Part 1

I have been building consumer behaviour models using regression and classification tree techniques for the last 4 years now. Most of this work has been in SAS. Now, there are a large number of interesting SAS procedures that are only slightly different from one another. Many of them can be interchangeable used, like PROC REG and PROC GLM.

But the single most important learning for me over this period has been that you can't spend enough time understanding and transforming the data. Very many interesting and potentially promising pieces of analysis go nowhere because the researcher has not enough time understanding the data. And then, having understood the data, transformed it into a form that is relevant to the problem at hand.

One of the seminal pieces on understanding data and plotting it in useful ways, is John Tukey's "Exploratory Data Analysis". This paper introduces some unique and important ways of graphing and understanding what the data is trying to say. One of my personal favorite SAS procedures is PROC MEANS and PROC UNIVARIATE. And of course PROC GPLOT. My advice to the budding social scientist and quantitative practitioner is to learn to use these techniques before learning the cooler procedures like LOGISTIC and Linear Models. This was one of the first things I learnt in my own journey as a statistical modeler and I have some very good and experienced colleagues to thank for making sure I learnt the basics first.

Over the next several posts, I am going to share some of my favorite forms of data depiction. The next several days will be a very interesting read, I promise.

No comments:

Sitemeter