Big Data Analytics: July 2009

Tuesday, July 21, 2009

More data visualization - this time about books

Ever wonder where the proof was about reading ..ummm, erotica being bad for you. Here it is. Check this link out.

An interesting study was done that went somewhat like this.
- Get the ten most frequent "favorite books" at every college using the college's Network Statistics page on Facebook. Possibly these books represent the intellectual calibre of the college.
- Get their SAT/ACT scores for the colleges.
- You can now get a relationship between types of book read and scholastic achievement

The results are pretty impressive, though still somewhat dubious. According to the study, Classics is usually good for you (agree with that), Erotica is way bad. Controversially, so is African-American literature and chick-lit. In the link, check out the visual that stacks the book by genre.

Make what you want about this, but be careful between causality and correlation.

Saturday, July 18, 2009

Data visualization

An example of a really well-done graphic is from the NOAA website. Science and particularly math afficionados seem to have a particular affinity to following weather science. (I am wondering whether it is a visceral reaction to global warming naysayers who, the scientists think, are possibly insulting their learning.)

The graph is a world temperature graph and this type of graph has come in so many different forms, it is difficult not to have seen such a graph. What I like about this is the elegant and non-intrusive form in which the overlays are done.
• By using dots and varying the size of the dots, the creator of the graph is making sure that the underlying geographic details (important in a world map where there is great detail that needs to be captured in a small area, therefore you cannot use very thick lines for country borders) still come through.
• The other thing that I liked is some of the simplications the creator has made. The dots are equally spaced but I am pretty sure that’s exactly not how the data was gathered. But to tell the story, that detail is not as important.

The graphic came from Jeff Masters' weather blog which is one of the best of its kind. Here's a link if you are interested.

Wednesday, July 15, 2009

Two great finds for physics fans

Back after a long break in the posts. Call it a mixture of home responsibilities, writer's block and just some plain old laziness.

One of my other interests (apart from statistics and social science) is physics and technology. I really enjoy reading about emerging applications of technology in various spheres of social and economic importance. The Technology Quarterly of the Economist is one of my treasured reads (though I end up reading very little of it, because of me wanting to leave aside "quality time" to do the reading).

I want to share two recent finds in the science and physics space. One is a really good book called "The Great Equations" by Robert Crease. The book covers ten of the seminal equations in physics and basically spins a story around how the equation formulator came about to creating the equation. There is usually a little mathematical proof behind the story usually, but most of the book is about the professional journey made by the scientist from an existing view of the world (or an older paradigm, to be more exact) to a new paradigm. And the paradigm is usually encapsulated in the form of an equation.

I found a couple of aspects about the journey extremely interesting. One, it was fascinating to have a window into the minds of physics greats (Newton, Maxwell, Einstein, Schrodinger, to name a few) and see how they synthesized the various different world views around them to create or arrive at their respective equations. The ability to deal with all the complexity of observed phenomena, the different philosophies and world views and to come up with something as elegant as a great equation, that defines genius for me. The second aspect that I found extremely interesting was that there was usually years and years of experimentation or mathematical work that preceded arriving at the great equation. One might be inclined to think that the great equations (given their utter simplicity) happen through a flash of inspiration. Nothing could be further from the truth.

The next find were the Feynman lectures. Now, many of us have read some of the Feynman lectures or have seen the lectures on a place like Youtube. But how cool would it be to have these lectures be annotated by Bill Gates? Check this link out at the Microsoft Research website. And happy watching!

I am guessing this blog has a fair share of aspiring or one-time physics and engineering fans. How do you keep your engineering bone tickled? I'd love to hear your pet indulgences.

Wednesday, July 8, 2009

Market chills

I have argued in a number of recent posts: here, here and here that we are nowhere close to the bottom when it comes to this economic downturn. The jobless numbers are back to sliding downwards at an accelerated pace after one month of deceleration.

And the markets seem to have caught the chills.

We discussed this at work a few months back. Someone who is very well-respected in banking circles and who has seen a few past recessions called out that you can tell that a recovery is underway when there is a sustained period where the indicators yo-yo between good and bad news. We seem to be entering this phase now.

Friday, July 3, 2009

Best Coffee Survey and a research methodology question

A recent Zagat survey rated the best coffee in the US. The best coffee rating went (expectedly, I guess) to Starbucks. Even though I have had better coffee at other places, I guess Starbucks combines great coffee with ubiquitous presence and therefore ends up getting the top rating. Now, I think Starbucks coffee is good and the baristas are extremely friendly, but in terms of pure coffee flavour, I would rate Panera's Hazelnut coffee higher. Also some of the Kona coffees that you find at places like WaWa are also really good. Any kind of place serving Jamaica's Blue Mountain coffee will obviously be great. So what makes Starbucks special? Are there other factors at play beyond the pure taste of the coffee.

One hypothesis is that the national-level presence of Starbucks could be contributing towards the voting going for Starbucks. In places where Starbucks has to compete with other chains like Peets (San Francisco) and Dunkin Donuts (New England), comparative ratings between Starbucks and other chains shows a narrower gap. In places where Starbucks has not competition however, it is likely to get disproportionately good ratings.

Let us say you are one of the contributors in the survey and are in St.Louis, MO. The competition for Starbucks in St.Louis is likely to be (I guess) the burnt robusta coffee at the local restaurant. In such a market, Starbucks will enjoy a clear advantage, both for the quality of the coffee as well as the ambience. So, let's say, you had to rate Starbucks on a scale of 1-5. It is likely you would give Starbucks a 4-5 in a non-competitive market, such as St.Louis, in the absence of valid benchmarks or competition to compare against. In a competitive market dominated by multiple brands, the difference between Starbucks and other brands is likely to be narrower. Also, the assertion can also be made that a more discerning audience (having had the opportunity to sample multiple chains) is less likely to give extremely high scores (4s and 5s) to any of the choices under consideration.

Therefore, the sampling design and the analysis methodology becomes extremely critical for surveys around this. To avoid the "no-competition" bias, there could a number of questions a market research analyst would need to ask herself:
1. Should we use only data points from places where there are multiple chains in the same geography? (Doesn't sound fair. We will be throwing away data, which a lot of sensible people have explained is a cardinal sin. We should probably weight the information in some way).
2. Should we consider data for the analysis only where a person has provided ratings about multiple chains voluntarily or penalize when people have not rated a chain that could have been rated?
3. Or are there modeling solutions available to manage this conundrum? Topic of my next post!