Big Data Analytics: June 2009

Sunday, June 28, 2009

Research funding or Why we have still haven't found a cure for cancer

Cancer has ben known to medicine since the time of Hippocrates. And modern medicine has known and studies causes of cancer since the mid-18th century. In 1971, Nixon announced a project to create a cure for cancer and (a la Kennedy, with regard to the moon mission) announced a definite cure in the next five years. Today, nearly 40 years and $105 billion dollars of public investment later (the private investment can be considered to be at least a significant fraction the public investment), we are no closer to finding a cure. In fact, after adjusting for age and size of the population, the cancer death rate has dropped by only 5% in the last 50 years. Compare this with nearly 60% drop in the death rates of diseases like pneumonia and influenza. Why is this the case?

Part of the reason is that cancer has multiple causes and we are not really sure about the true causal linkage between the various factors and the cancer cells misbehaving. Environmental factors cause some types, exposure to radioactivity causes other types, tobacco is a well-known factor causing mouth and lung cancer and there are viruses that cause still some other types. The common thread linking all of these causal factors and the various different types of cancers they cause is difficult to isolate. And therefore while we continue to make some improvements around the margin (getting people to live for a few additional months or years), a true cure has been elusive.

But another likely cause is the way in which various research funding agencies have made investment prioritization decisions. The funds have invariably gone to small-budget, incremental improvement type projects which are usually along previously established avenues of inquiry. The truly innovative approaches and especially the risky (from a success of the project standpoint) proposals have seldom obtained funding. The process developed to identify research subjects have been good at avoiding funding truly bad research. However, by the same token, they have continued to fund projects that are conventional and low risk and as a result, only contributing to marginal improvements. A recent article in the New York Times sheds some more light on to this topic.

My view is that this is quite a common problem (sub-optimal funds allocation) when funds are limited. This is not only true for cancer research in particular or any other form of medical research in general, but even in the financial services industry that I am part of. The funds allocation agency feels pressure not to waste the limited funds and also to make sure that the maximum amount of research projects get the benefits of the fund. Therefore the push to fund projects that are from proven areas and are set up to make incremental improvements to the areas. Also, this leads to a tendency to parcel the funds and distribute small quantities into a large number of projects, While what they should be paradoxically doing (given the shortage of funds) is to make the bold bet and fund those areas which may not be as proven but show the highest promise for overall success. Again, this happens more commonly than in the field of cancer research.

Challenging the financial budget, the status-quo way of thinking is not easy to do. There will be people who will say No and be discouraging, rarely because them have something to lose but mostly because the tendency is to play safe. People usually do not get fired for taking the safe, conventional-wisdom driven decisions. It is the risk-takers that get panned if the risks do not play out as expected.

Thursday, June 25, 2009

Why I blog

I have been regular at maintaining the blog for the past month or so. My feelings have been mixed so far. On the one hand, it is an effort to keep up the writing effort day after day. One of my goals is to make sure that the blog remains fresh for future readers. And the freshness of the blog is to totally a function of keeping up the effort of adding new and interesting material. With my day job and with the challenges of keeping up with the ever growing demands of our 5-month old, writing is always not easy.

But being a glass-half-full kind of guy, what has this exercise brought me?

For starters, it has got me to start writing again. I am a firm believer that writing is a great way of organizing your thoughts and making them more logical and structured. And it is a habit that I had at one time, lost at some point and am keen on regaining again. Communication is an important skill in today's world. With all the clutter, media generated noise, terabytes of data and messages flowing back and forth, the Internet driven distractions, it is important to cut through the clutter and reach out powerfully with one's words to make a difference. Gandhi had a difficult enough time getting his word out to millions of his countrymen and getting them united against the British. But that was close to a century back. Imagine Obama's difficulty in getting his thoughts out to people in today's hyper-information age. And the way you get better at communication is by keeping at it through weekdays and weekends, through work deadlines and daughter's shrieks of excitement. Hopefully, I am getting better at this stuff.

The other big positive for me is that I am beginning to learn a lot more and at a much faster pace on my professional interest, math models and statistical inference. As Hal Varian, chief economist at Google has famously remarked, the statistician job is going to be the sexy job for the next ten years. And this field is evolving so rapidly that it is extremely critical to keep updating one's knowledge and skills and remaining ahead of the curve. In order for me to provide a stream of meaningful material for the audience of this blog, I have had to spend a good amount of my time reading and updating my own knowledge base. Just last morning, I managed to read an interesting article on multi-level modeling. This took me to a web-site dedicated to multi-level modeling at the University of Bristol. And the lecture notes in turn made me aware of some of the ways I could tackle some ticklish problems at work. (Look at this really cool lecture. It is a video link and needs Internet Explorer as the browser.) I have become much more aware of the latest problems and solution kits out there in the last month, than what I learnt in the past several years. A huge plus for me.

So all-in-all, I am hoping to learn something out of all this and make at least a fractional improvement to what I want to add to the world. And hopefully keep my audience engaged and interested in the stuff I write. My writings are clearly not meant for the masses, I don't have any such hopes! The people who are likely to like my writing are going to be similar to me: numbers-obsessed, math-loving and tech savvy geeks. And if I can make a fraction of a difference to my readers as this exercise is making for me, I will be a happy blogger.

Monday, June 22, 2009

... the Great Escape? Or the Great Deception?

In an earlier post, I commented on the now famous "Green Shoots" of recovery but the very real long term threats to continued economic growth. It turns out that the "so-called" economic recovery seems to be more of a financial market recovery. Conventional wisdom goes that the financial markets turnaround precedes the real economy turnaround by about 6 months. Early signs did point to this phenomenon. Market indices in both emerging markets and the developed markets showed smart 30%+ growths in the last 3 months. Corporate bond offerings began to surge and even below investment grade offerings jumped up (and were well subscribed) in June.

However, some temperance seem to have set in of late. Emerging market indices like the Sensex and the Hang Seng are at least about 10-15% down from their early June peaks. Likewise with the DJIA. The steady upper trend seen for the best part of the last 8 weeks seems to have been interrupted. The yield on 10-year US treasuries had gone up to nearly 4% but is not trended back down to about 3.5%, basically signalling that everything is not as hunky-dory as we expected. There is still a high demand for quality (the irony of it all is that quality is denoted by US treasuries!). The Economist states that all economic indicators have not magically turned to positive, which is what one would expect if the markets and the media are to be believed. According to the Economist,

The June Empire State survey of manufacturing activity in New York showed a retreat. German export figures for April showed a 4.8% month-on-month fall. The latest figures for American and euro-zone industrial production showed similar dips. American raw domestic steel production is down 47% year on year; railway traffic in May was almost a quarter below its level of a year earlier. Bankers say that chief executives seem a lot less confident about the existence of “green shoots” than markets are.

We shouldn't be either. For a bunch of reasons.
1. Losses are nowhere close to bottoming out. Expectations for large credit defaults amongst corporates is expected to be higher than 11% for 2009 and continue to remain there for 2010.
2. At the individual level, unemployment is showing no signs of abating. There was a good article in the Washington Post today on how the economic recovery seems to be taking place in the absence of jobs. Check this link out. Unemployment is expected to be north of 10% and remain there for a good part of 2009 and into 2010. Unemployment is closely linked with the consumer confidence number and therefore any sluggishness in the job market is going to impact consumer spending and therefore further impact the rate of recovery of the economy.
3. Emerging markets were the promised land for the world economy, not not any longer. The markets don't seem to think so however. Indian economic growth is expected to be the slowest in the past 6 years. With much more fragile safety nets in the Asian economic tigers, these economies are going to be even more careful while navigating out of the downturn.

In short, a long haul seems clear. Also seems clear is a fundamental remaking of industries as a whole. Financial services, automobiles and potentially health-care are industries where a new business model is ripe for discovery. This should create many more opportunities for the data scientist, the topic of my next post.

Saturday, June 20, 2009

Monte Carlo simulations gone bad

In my series on stress testing models, I concluded with Monte Carlo simulations as a way of understanding the set of outcomes a model can produce and being able to handle a wide set of inputs without breaking down. However, Monte Carlo simulations can be done in ways that at best, are totally useless and at worst, can produce highly misleading outcomes. I want to discuss some of these breakdown modes in this post.

So, (drumroll), top Monte Carlo simulation fallacies I have come across.
1. Assuming all of the model drivers are normally distributed
Usually the biggest fallacy of them all. I have seen multiple situations where people have merrily assumed that all drivers are normally distributed and hence can be modeled as such. In most events in nature, heights and weights of human beings, sizes of stars, it is fair to expect and find distributions that are normal or even close to normal. However, not so with business data. Because of the influence of human beings, business data tends to get pretty severely attenuated at places and stretched out at some other places. Now, there are a number of other important distributions to consider (which will probably form part of another post sometime), but assuming all distributions are normal is pure bunkum. But this is usually a rookie mistake! Move on to ...

2. Ignoring the probabilities of extreme tail events
Another quirk of business events is the size and frequency of tail events. Tail events astound us frequently with both their size and their frequency. Just when you thought Q4 08's GDP drop of close to 6% is a once-a-100-years event, it then goes and repeats itself in the next quarter. Ergo, with 10% falls in market cap in a day. Guess what you see the next trading day! Short advise is, be very afraid of things that happen in the tails. Because these events occur so infrequently, distributions are usually misleading in this space. So if you are expecting your model to tell you when things go bump at night, you will be in for a rude shock when they actually go bump. But why go to the tails when there are bigger things that lurk in the main body of the distribution, such as...

3. Assuming that model inputs are independent
Again, this is another example of a lazy assumption. People make these assumptions because they are obsessed with the tool at hand and its coolness-coefficient and cannot be bothered to use their heads and use the tool to solve the problem at hand. I am going to have a pretty big piece on lazy assumptions soon. (One of my favourite soap-box items!) When people run Monte Carlo simulations, the assumptions and inputs to the model are usually correlated to each other to different degrees. This means that the distributions of outcomes that you get at the end are going to crunched together (probability-density wise) at some places and are going to be sparse at some other places. But assuming a perfectly even distributions on either side of the mean is really not the goal here. The goal is to get as close an approximation of real-life distributions as possible. But then if only things were that simple! Now, you could be really smart and get all of the above just right and build a really cool tool. You could then get into the fourth fallacy of thinking ...

4. That it is about the distribution or the tool, it is NOT! It is about what you do with the results of the analysis
The Monte Carlo simulation tool is indeed just that, a tool. The distributions produced at the end of running the tool are not an end in themselves, they are an aid to decision making. In my experience, a well-thought out decision making framework needs to be created to make use of the distribution outputs. The decision-making framework could go something as follows. Let's take a framework to evaluate investment decisions, that uses NPV. One framework could be: I will make the investment only if a.) the mean NPV I can make is positive, and b.) less than 20% of the outcomes are negative NPV, and c.) less than 5% of the outcomes are negative NPV of less than $50 million. There's really no great science in coming up with these frameworks, but it has to be something that the decision maker is comfortable with and it should address uncertainty in outcomes.

So, have you come across some of these fallacies in your work? How have you seen the Monte Carlo tool used and misused in your work? And what decision making frameworks (if any) were allied with this tool to drive good decisions?

Wednesday, June 17, 2009

Why sitting down and talking does not help - (and what it means for the data scientist)

What is the sane person's advise to two people who cannot agree on something? It is usually sit down and resolve their differences. A set of recent recent studies seem to suggest that it doesn't really help.

Some recent studies have shown that when people with strong opposing positions are put together to talk it out, it makes them even more entrenched in their opinions. This is the point put forward by Cass Sunstein in his book Going to Extremes. Even when the groups/ people with opposing views are presented objective evidence, people tend to "see" what they want to believe in the data and ignore the rest.

Another study cited by the Economist struck a similar message. The study was looking at self-help books which stress positive thinking, and their impacts on people. What the study found was that positive thinking only helps for people who are predisposed to thinking positively. The study can be credited to Joanne Wood of the University of Waterloo in Canada and her colleagues. The researchers report in Psychological Science journal that when people with high self-esteem are made to repeat positive reinforcing messages, they do tend to take more positive positions (on standardized tests) than people who do not repeat positive reinforcing messages.

So far so good. It sounds as though positive reinforcing helps. But when the test was done on people with low self-esteem, the results were quite the opposite. People who repeated the positive reinforcing message took less positive positions than the ones that did not repeat the message. So it seems to imply that positive reinforcement actually hurts when applied to people who are inclined to believe otherwise. For me, this sounds like another example of people entrenching towards their own biases. When people with entrenched positions are forced to take a contrary position (or look at objective data), they tend to entrench even further on their original positions.

So what are the implications for the data scientist from all of this?
Mostly that predisposed positions produce a sort of "blindness" to objective data. We all suffer from confirmation bias, we like to believe what we like to believe. It is therefore a great effort for us to actively look at data objectively and take what that data is telling us, vs. putting the appropriate spin that suits us. The data scientist needs to exercise tremendous discipline here. It takes a superhuman effort not to succumb to our biases and to (not) believe what we want to believe, and take a genuine interest in forming an objective opinion.

One of the bigger learning for me in all of this is also the importance of give-and-take in making progress on any issue. Because of the entrenchment bias, people seldom change their views (i.e., come around to your way of thinking) based on objective data and logical persuasion. They only come along when they have skin in the game and for that to happen, there has to be an active element of give-and-take between the two parties. Which makes me even more admiring of GOOD politicians and diplomats. Their ability to keep moving forward on an issue in a bipartisan manner comes out of their skill in give-and-take, and thereby overcoming the entrenchment bias.

Sunday, June 14, 2009

Connecting to the data centers - Netbooks

An apt follow-up to the post on data centers should be on the evolving tools to access the data centers. I had gone to Costco this morning and came across Netbooks. These are just stripped down laptops (or notebooks, if you may) that are perfect for accessing the Internet and getting your work done. Quite the rage with the college crowd apparently.

This is an attempt by the computer hardware industry to break the price barrier on portable computers. Used to about $1500

and came down to about $1000 four to five years back. But then the barrier stayed there for a while, with manufacturers adding feature on top of feature but refusing to reduce price. Till netbooks came along. These devices are priced at about $200-$350 and are pretty minimalist in their design. They have a fairly robust processor, a good sized keyboard and screen. No CD drive, for only neanderthals use a CD. But loaded when it comes to things like a Webcam, WiMax, etc. The netbook idea has two parallel phenomena that drove its evolution. One, the high-profile $100 laptop for third-world kids that really didn't go anywhere. The other was the increase in broadband penetration in the United States.

Another driver (probably) is the coming of age of the Millennial generation. When I grew up, the cool computer company of our times was Microsoft (or Apply, if you hated Microsoft). Both these companies had built their business models on paid products, products that needed upgrade and which cost money. We had therefore a certain reverence towards these companies and therefore an implicit acceptance of their pay-for-use business model.

Today's generation has come of age in the age of Google, Linux, Napster and other social networking sites. All of which are free. Today's kids feel less beholden to the idea of a computer company putting out formal products which you need to pay for and which get upgraded once every two years, for which you need to pay for again. In today's age, the idea of freeware and products that actively evolve with use is becoming more and more accepted. Ergo, the netbook.

Enough of my pop-psychology for now. Anyway, netbooks are really cool gadgets and I am tempted to get one really soon. The Economist had a good article on the subject. Let me know if you are early adopters of netbooks and your experiences so far.

Friday, June 12, 2009

Stress testing your model - Part 3/3

We discussed two techniques of ensuring the robustness of models in two previous posts. In the first post, we discussed out-of-sample validation. In the second post, we discussed sensitivity analysis. I find sensitivity analysis to be a really valuable technique for ensuring the robustness of model outputs and decisions driven by models - but only when it is done right.

Another and a more computing-intensive technique of ensuring model output robustness is Monte Carlo simulation. Monte Carlo simulation basically involves running the models literally thousands of time and changing each of the inputs a little with every run. With advances in computing power and the power being within reach of most modelers and researchers, it has become fairly easy to set up and run the simulation.

So let's say, we have a model with 3 inputs. And now let's assume that the inputs are varied in 10 steps over its entire valid range. So now the model will produce 1000 different outputs for various values of inputs (1000 = 10 x 10 x 10), each output having a theoretical probability of 0.001.

How are the inputs varied?
Typically using a distribution that varies the inputs in a probabilistic manner. The input distribution is the most important assumption that goes into the Monte Carlo simulation. The typical approach is to assume that most events are normally distributed. But the reality is that normal distribution is usually observed only in natural phenomena. In most business applications, distributions are usually skewed in one direction. (Take loan sizes on a financial services product, like a credit card. The distribution is always skewed towards the higher side, as balances cannot be less than zero but can take really large positive values.)

Correlation or covariance of the inputs
In a typical business model, inputs are seldom independent; they have various degrees of correlation. It is important to keep this correlation in mind while running the scenarios. By factoring in covariance of inputs explicitly while running the simulation, the output is probabilistically weighted towards results which occur when the inputs are correlated.

Of course, as with any piece of modeling, there are ways in which this technique can be misused. Some of my pet gripes about MC simulation will form the subject of a later post.

Wednesday, June 10, 2009

The massive cloud of 1s and 0s

Read an interesting article in the New York Times about the growth in data centers as we become an increasingly internet based world. Some of the numbers aroun

d the data center capacities at places like Microsoft, Google and for e-commerce or bidding sites like Amazon and eBay, was quite astounding. Not to mention the various electronic financial exchanges in the world.

Some of the numbers are quite astounding. Microsoft has more than 200,000 and its massively bigger competitor has to have more. And this massive data center infrastructure capability is already beginning to have a serious impact on the power requirements for our new world. More power for our data centers to upload of pictures of the weekend do onto Facebook, or an intact polar ice-cap. You chose!

Tuesday, June 9, 2009

Green shoots ... or bust

Equity markets, both international and US, seem to have taken to heart the signs of bottoming of the world economy and the rebound seen in Asia. The Brazilian, Chinese and Indian stock markets are at least 50% higher than the lows in Q4 2008. The DJIA went down to the 6500 for a while but has since rebounded to gyrating around the 8500 market and has on occasion, flirted with the 9000 level.

The rates of job-loss is falling in the US and despite a glut in world oil supply, crude prices are nearing the $70 mark after crashing down to the $30s only recently. Are we past the worst then? Despite the lingering weakness in W.Europe (something seen arguably since the start of WW II!), the signs of economic recovery seem to be unmistakable.

Is the US consumer then going to go back to his/her free-spending ways? While we seem to have come up a fair bit from the Q4 depths, at least from a consumer confidence standpoint, there could be a few big obstacles to growth.
1. The budget deficit. With the famous American aversion to taxes and the growing burden of entitlements (driven mainly by healthcare costs) as the baby-boomer generation retires, the deficit is only going to get worse.
2. The cost of borrowing to feed the deficit. The US Treasury's place of pride as the investment of the highest quality could be under threat as the domestic debt as a % of GDP grows. The US government will need to increasingly borrow more and pay higher interest rates for the borrowing. The higher interest burden is going to crimp the ability to make productive investments.
3. Finally, with increasing protectionism and government involvement in the economy, the vitality of US business enterprise to identify and capitalize on opportunity looks to be suppressed for the next several years.

A number of prognosticators have made some long-range predictions of the US Economy in this article. An interesting read, as the forecasters have taken a 5-10 year view rather than the 6-12 month view typically taken by realtor types. Another interesting article, about the long term speed limit of the US economy from the Economist. Summarizing the article,

According to Robert Gordon, a productivity guru at Northwestern University, America’s trend rate of growth in 2008 was only 2.5%, the lowest rate in its history, and well below the 3-3.5% that many took for granted a few years ago. Without factoring in the financial crisis, Mr Gordon expects potential growth to fall to 2.35% over the coming years.

Sunday, June 7, 2009

Stress testing your model - Part 2/3

Continuing on the topic of risk management for models. After building a model, how do you make sure the model remains robust under working conditions? More importantly, make sure it works well under extreme conditions? We discussed the importance of independent validation for empirical models in a previous post. In my experience, model failures have been frequent when the validation process has been superficial.

Now, I want to move on to sensitivity analysis. Sensitivity analysis involves understanding the variability of the model output as the inputs to the model are varied. The inputs are changed by + or - 10 to 50% and the output is recorded. The range of outputs gives a sense of the various outcomes that can be anticipated and one needs to prepare for. Sensitivity analysis can also be used to stress test the other components of the system which the model drives. For example, let's say the output of the model is a financial forecast that goes into a system that is used to drive, deposit generation. The sensitivity analysis output gives an opportunity to check the robustness of the downstream system. By knowing that one might require occasionally to generate deposits at 4-5 times the usual monthly volumes, one can prepare accordingly.

Now, sensitivity analysis is one piece of stress testing that has usually been misdirected and incomplete. Good sensitivity analysis looks at both the structural components of the model as well as the inputs to the model. Most sensitivity analysis I have encountered stress only the structural components. What is the difference between the two?

Let's say, you have a model to project the performance of the balance sheet of a bank. One of the typical stresses that one would apply is to the expected level of losses on the loan portfolio of the bank. A stress of 20-50% and sometimes even 100% increase in losses is applied and the model outputs are assessed. When this is done consistently with all the other components of the balance sheet, you can get a sense of the sensitivity of the model to various components.

But that's not the same as the sensitivity to inputs. Because inputs are based in real-world phenomena, their impact is usually spread out to multiple components in the model. For example, if the 100% increase in losses were driven by a recession in the economy, there would be other impacts that one would need to worry about. Now, a recession is usually accompanied by a flight to quality from investors. So if there is a recessionary outlook, the value of equity holdings could crash as well due to equity investors moving out from equities (selling) and into more stable instruments. A third impact could be the impact of higher capital requirements on the value of traded securities . As other banks face the same recessionary environment, their losses could increase to such an extent that a call to increase capital becomes inevitable. How does one increase capital? The easiest route is to liquidate existing holdings. Driving a greater fall in the market prices of traded securities. Thus, putting further stresses on the balance sheet.

So, the scenario of running a 50% increase in loan losses is a purely illusory one. When loan losses increase, one has to contend with what the fundamental driver could be and how can that fundamental driver impact other portions of the balance sheet.

The other place where sensitivity analysis is often incomplete is by not looking at the impact of upstream and downstream processes and strategies. A model is never a stand-alone entity. It has upstream sources of data and down-stream uses of the model output. So if the model has to face situations where there are extreme values of inputs, what could be the implications on upstream and downstream strategies? These are the questions any serious model builder should be asking.

This discussion on sensitivity analysis has hopefully been eye-opening to modeling practitioners. Now, we will go on to a third technique, Monte Carlo simulation in another post. But before we go there, what are other examples of sensitivity analyses that you have seen in your work? How has this analysis been used effectively (or otherwise)? What are good graphical ways of sharing the output?

Saturday, June 6, 2009

Stress testing your model - Part 1/3

So, you've built a model. You have been careful about understanding your data, transforming it appropriately, used the right modeling technique, done an independent validation (if it is an empirical model) and now you are ready to use the model to make forecasts, drive decisions, etc.

Wait. Not so fast. Before the model is ready for prime time, you need to make sure that the model is robust. What defines a robust model?
- the inputs should cover not just the expected events but also extreme events
- the model should not break down (i.e., mispredict) when the inputs turn extreme (Well, no model can be expected to perform superbly when the inputs turn extreme. If models could do that, the events wouldn't be termed extreme events. But the worst thing that a model can do is provide an illusion of normal (english usage) outputs when the inputs are extreme.

I want to share some of the techniques that are used for understanding the robustness of the models, what I like about them and what I don't.

1. When it comes to empirical models, one of the most useful techniques is Out of Sample Validation. This is done by building the model on one data set and validating the algorithm on another. For the truest validation, the validation dataset should be independent of the build, should be drawn from a different time period. "Check-the-box" type validation is when you validate the model on a portion of the build sample itself. Such validation often holds and just as often offers a false sense of security, because in real terms, you have really not validated anything.
Caveat: Out of sample validation is of no use if the future is going to look very different from the past. Validating a model to predict the probability of mortgage default using conventional mortgages data would have been of no use in a world where no-documentation mortgages and other exotic-term mortgages were being marketed.

The other two approaches I want to discuss are Sensitivity Analysis and Monte Carlo Simulation. I will cover them in subsequent posts.

Tuesday, June 2, 2009

Using your grey cells ...

Growing up as a singularly unathletic child, my favourite form of recreation was usually through books. And a favourite amongst the books were Agatha Christie's murder mysteries featuring Hercule Poirot. Poirot fascinated me. I guess there was the element of vicariously living through the act of evil being punished by good. (Which probably attracts us to all mystery writers).

But another aspect that made Poirot more appealing than the more energetic specimens like Sherlock Holmes was his reliance on "ze little grey cells". The power of analytical reasoning practiced through the simple mechanism of question and answer being used to solve fiendishly difficult murders. How romantic an idea!

I recently came across a intriguing set of problems, which require rigorous exercise of the grey cells. Called Fermi problems, these are just plain old estimation problems. Typical examples go like "estimate the number of piano tuners in New York City", "estimate the number of Mustangs in the US". It requires one to start out with some basic facts and figures and then get to the answer through a number of logical reasoning steps. For the piano tuner question, it is usually good to start off with some estimate of the population of New York City. Starting off with ridiculous numbers (like 1 million or 100 million) will definitely lead you to the wrong answer. So, what the estimation exercise really calls for is some general knowledge with some ability to think and reason logically.

The solution to the piano tuner problem typically goes as follows:
- Number of people in NYC -> Number of households
- Number of households -> Estimating the numbers with a piano
- Number of pianos -> Some tuning frequency -> Demand for number of pianos that need tuning in a month
- Assuming a certain number of pianos that can be tuned in a day and a certain number of working days, you get to the number of likely tuners

The exercise definitely teaches some ability to make logical connections. The other thing this type of thinking teaches is parsimony of assumptions. One could make the problem more complex by assuming a different population for NYC's different buroughs, different estimates for the proportion of households with pianos for Manhattan vs the Bronx and so on. In practice however, these assumptions only introduce false precision to the answer. Just because you have thought through the solution in an enormous degree of detail doesn't necessarily make it right.

Some typical example of Fermi problems can be found at this link. Enjoy the experience. And I would love to hear some of the feelings that strike you as you try and solve these problems. Some of the "a-ha" moments for me were around the parsimony of assumptions, needing to find the point of greatest uncertainty and then fix it with the least cost, so as to narrow down the range of answers.

The exercise overall taught me a fair bit about about modeling, the way we approach modeling problems, ranges of uncertainty and how we deal with them, parsimony of assumptions.