Big Data Analytics: May 2009

Sunday, May 31, 2009

The counter-intuitiveness of probability - small sample sizes

Sports enthusiasts amongst you (and who read this website) have to be into sports statistics. My earliest memories about cricket were not about me playing my first impressive air cover-drive or about charging in (in my shorts) and delivering a toe-crushing yorker fired towards the base of the leg-stump. My most vivid early memories were about buying cricket and sports magazines hot off the presses and reading sheets and sheets of cricket statistics.

These statistics covered a wide range of topics. There was the usual highest number of runs, highest number of wickets type stuff. There were also ratio-type statistics: number of 5WI per game, number of centuries per innings, proportion of winning games in which a certain batsman scored a century. With a lot of the ratio metrics, there was usually a minimum in the form of number of matches, innings the player should have played before being part of the statistics. For my unschooled mind, it was a way of eliminating the one-match wonders, the flukes from the more serious practitioners.

With the gift of recapitulating some of those memories and looking at them afresh with my relatively better schooled analytical mind, it struck me that what the statistician (or more precisely, compiler of statistics) was trying to do was to use the law of large numbers (and large event sizes) to produce a distribution centered around the true mean. Put in another way, when the sample is small, one is more apt to get extreme values. So, if the "true" average for number of innings per century is 4.5, there could be 3-4 innings' stretches where the player scores consecutive hundreds, pushing the average well down. And if these stretches occur (by chance) at the start of someone's career, it is wont to lead to wrong conclusions about the ability of the batsman.

A simple exercise. How many Heads to expect out of 5 coin tosses with an unbiased coin? One would say, between 2 and 3 Heads. But if you had one trial vs several, what would you expect? What would the mean look like and what would the distribution be? For the sake of simplicity, let's label 2 and 3 Heads as Central values, 1 and 4 as Off-Central and 0 and 5 as Extreme values.

I did a quick simulation. Following are the results around mean, Central, Off-Central and Extreme values.
With 1 trial, the results were: 3,1,0,0.
With 3 trials, the results were: 1.66,0,2,1.
With 5 trials, the results were: 1.8,1,4,0.
With 10 trials, the results were: 2.7,10,0,0.
Now 10 is no magic number but it is easy to see how one can get a greater proportion of central values (or values closer to the mean) as the number of trials gets larger. I would love to get a cool snip of SAS or R code that can do this simulation. And hence the push to eliminate outliers by increasing the number of trials.

Now this is the paradox of small trials. When the number of trials are small, when you have fewer shots at observing something, chances are greater that you'd actually see more extreme values whose frequency cannot be predicted. What does this mean for risk management? Does one try and manage greater volatility at a unit level or lesser at the system level? And how do you make sure the greater volatility at the unit level does not sink your business?

Friday, May 29, 2009

Data handling - the heart of good analysis - Part 1

I have been building consumer behaviour models using regression and classification tree techniques for the last 4 years now. Most of this work has been in SAS. Now, there are a large number of interesting SAS procedures that are only slightly different from one another. Many of them can be interchangeable used, like PROC REG and PROC GLM.

But the single most important learning for me over this period has been that you can't spend enough time understanding and transforming the data. Very many interesting and potentially promising pieces of analysis go nowhere because the researcher has not enough time understanding the data. And then, having understood the data, transformed it into a form that is relevant to the problem at hand.

One of the seminal pieces on understanding data and plotting it in useful ways, is John Tukey's "Exploratory Data Analysis". This paper introduces some unique and important ways of graphing and understanding what the data is trying to say. One of my personal favorite SAS procedures is PROC MEANS and PROC UNIVARIATE. And of course PROC GPLOT. My advice to the budding social scientist and quantitative practitioner is to learn to use these techniques before learning the cooler procedures like LOGISTIC and Linear Models. This was one of the first things I learnt in my own journey as a statistical modeler and I have some very good and experienced colleagues to thank for making sure I learnt the basics first.

Over the next several posts, I am going to share some of my favorite forms of data depiction. The next several days will be a very interesting read, I promise.

Tuesday, May 26, 2009

Macro-economic indicators - a good source

I found a good source of macro-economic indicators on the Internet. This is on the NY Times web-site. Go to the Blogs section and look for a blog called Economix. There is a really good graphic along the right side of the page. The graphic covers important macro-economic metrics such as the unemployment rate, inventory-to-sales ratio, GDP growth, consumer price index (or inflation), factory orders, durable goods orders, etc. Click for a link here.

Why are these metrics important? Looking across a broad swathe of metrics gives a good blend of the various viewpoints one should consider when forming a view of the economy and where it is headed. And it is extremely clear that while some of the indicators seem to have stabilized and are pointing to a bottom having been reached, it is by no means consistent across indicators.

We have merely gone from all bad news to mostly bad news with some stable news thrown in.

Sunday, May 24, 2009

Simple regression applications - estimating flu impacts

Came across this interesting piece around the estimation of flu impacts. From Slate. One of my favourite web-sites.

You must have across the new articles which say that flu caused so many thousand deaths in a certain year. Now attributing deaths to flu is not as straight-forward as it would seem. Flu is not the "Cause of Death" that often in a death certificate. Flu usually kills by causing secondary conditions like pneumonia, heart disease, etc. which the enfeebled body is not able to resist. So one can find relatively few cases where the cause of death can be directly attributed to the flu. So how is the estimation done? The answer is simple regression using deaths as the dependent variable and the number of flu cases as the independent variable.

One piece of data is the number of deaths in the US. This can be broken down by week or by month for the flu season. (Approx Oct to Apr) The other piece of data is the number of flu cases tracked by various testing labs across the country. This information is also available broken down by week or by month. The CDC website is a ready source of such morbidity, going back at least to the early 90s! Then it is a matter of running a simple regression to create a link between flu cases and deaths. The regression takes the form: deaths = intercept + co_eff * number of flu cases, the intercept being the number of deaths one can expect due to other baseline causes.

It almost seems too simple to be true. How can you be sure that deaths caused in a certain month can be linked to flu cases from that period? Or does one assume a certain lag for flu to lead to mortality? How do we know we have normalized for everything else? What is the CI of the estimates? Check this paper out by William W. Thompson for more details!

Now with the emergence of the potentially more deadly H1N1 flu, how can one go about estimating its impact?

Saturday, May 23, 2009

Demise of Chrysler's dealers and a modeling problem

An interesting thought experiment.

Chrysler is closing many of its dealers over the next 3 weeks as part of its planned bankruptcy. If you are a dealer who is facing the axe, how might you go about liquidating your inventory at the best possible return? Or more realistically, the least loss.

One idea might be to have an auction for individuals. Define a strike price (don't disclose, of course!), promise at least as good a price as what other dealers for comparable cars are willing to offer. And let the bidding begin. There might be some marketing spend to print out a few hundred mailers and send it out to the neighborhood.

Another parallel strategy can be to have a reverse-auction for other dealers which can then be volume driven. Assuming that Chrysler demand does not plummet to zero when its factories are shut and it is going through its bankruptcy, there will be a few months-long phase when Chrysler dealers will not have enough inventory to meet their demand. At least some of them. The reverse action you set up can promise to drive down the average price of the vehicle if the purchasing dealer is willing to pick up volume. So internally, have a strike price of (say) $20K / vehicle if you buy one Jeep, but the price drops to $18.5K if you buy 2, $17.75K if you are willing to buy 3 and so on.

But any other ideas on how to model this? What might a good algorithm be? Does one use game theory to solve the problem?

Sunday, May 3, 2009

It's been a long time

It's been long since I shared my views on my blog. Excuses are multiple, but none is particularly credible.

But I think this is a great time to start again. In the last 2 years, the world has been painfully exposed to the fallibility of models. From being the engines of modern finance and the economy at large, models have gone to becoming the reason #1 for the economy's collapse. Even worst, model builders have become a bit of a laughing stock in the post-Wall Street society.

Here's an attempt to set models and statistics to back where they belong. So lets see how it goes this time around.