Big Data Analytics: The counter-intuitiveness of probability

Sports enthusiasts amongst you (and who read this website) have to be into sports statistics. My earliest memories about cricket were not about me playing my first impressive air cover-drive or about charging in (in my shorts) and delivering a toe-crushing yorker fired towards the base of the leg-stump. My most vivid early memories were about buying cricket and sports magazines hot off the presses and reading sheets and sheets of cricket statistics.

These statistics covered a wide range of topics. There was the usual highest number of runs, highest number of wickets type stuff. There were also ratio-type statistics: number of 5WI per game, number of centuries per innings, proportion of winning games in which a certain batsman scored a century. With a lot of the ratio metrics, there was usually a minimum in the form of number of matches, innings the player should have played before being part of the statistics. For my unschooled mind, it was a way of eliminating the one-match wonders, the flukes from the more serious practitioners.

With the gift of recapitulating some of those memories and looking at them afresh with my relatively better schooled analytical mind, it struck me that what the statistician (or more precisely, compiler of statistics) was trying to do was to use the law of large numbers (and large event sizes) to produce a distribution centered around the true mean. Put in another way, when the sample is small, one is more apt to get extreme values. So, if the "true" average for number of innings per century is 4.5, there could be 3-4 innings' stretches where the player scores consecutive hundreds, pushing the average well down. And if these stretches occur (by chance) at the start of someone's career, it is wont to lead to wrong conclusions about the ability of the batsman.

A simple exercise. How many Heads to expect out of 5 coin tosses with an unbiased coin? One would say, between 2 and 3 Heads. But if you had one trial vs several, what would you expect? What would the mean look like and what would the distribution be? For the sake of simplicity, let's label 2 and 3 Heads as Central values, 1 and 4 as Off-Central and 0 and 5 as Extreme values.

I did a quick simulation. Following are the results around mean, Central, Off-Central and Extreme values.
With 1 trial, the results were: 3,1,0,0.
With 3 trials, the results were: 1.66,0,2,1.
With 5 trials, the results were: 1.8,1,4,0.
With 10 trials, the results were: 2.7,10,0,0.
Now 10 is no magic number but it is easy to see how one can get a greater proportion of central values (or values closer to the mean) as the number of trials gets larger. I would love to get a cool snip of SAS or R code that can do this simulation. And hence the push to eliminate outliers by increasing the number of trials.

Now this is the paradox of small trials. When the number of trials are small, when you have fewer shots at observing something, chances are greater that you'd actually see more extreme values whose frequency cannot be predicted. What does this mean for risk management? Does one try and manage greater volatility at a unit level or lesser at the system level? And how do you make sure the greater volatility at the unit level does not sink your business?

Big Data Analytics

Sunday, May 31, 2009

The counter-intuitiveness of probability - small sample sizes

1 comment:

Sitemeter