Big Data Analytics: December 2009

Tuesday, December 29, 2009

A serious problem - but analytics may have some common-sense solutions

My family and I just got back from a India vacation. As always, we had a great time and as always, the travel was painful. One, because of its length and also because of all the documentation checks at various points in the journey. But in hindsight, I am feeling thankful that we were back in the States before the latest terrorist attack on the NWA jetliner to Detroit took place. A Nigerian man, Umar Farouk AbdulMutallak, tried to set off an explosive device but thankfully did not succeed.

Now apparently, this individual was on the anti-terrorism radar for a while. He was on the terrorist watch-list but not on the official no-fly list. Hence, he was allowed to board the flight going from Amsterdam to Detroit, where he tried to perpetrate his misdeed. The events have raised a number of valid questions on the job the TSA (the agency in charge of ensuring safe air travel within and to/from the US) is doing in spotting these kinds of threats. There were a number of red flags in this case. A passenger who had visited Yemen - a place as bad as Pakistan when it comes to providing a safe haven for terrorists. A ticket paid in cash. Just one carry-on bag and no bags checked in. A warning coming from this individual's family, no less. A denied British visa - another country that has as much to fear from terrorism as the US. The question I have is: could more have been done? Could analytics have been deployed more effectively to identify and isolate the perpetrator? And how could all of this be achieved without giving a very overt impression of profiling? A few ideas come to mind.

First, a scoring system to constantly upgrade the threat level of individuals and provide a greater amount of precision in understanding the threat posed by an individual at a certain point in time. A terror list of 555,000 is too bloated and is likely to contain a fair number of false positives. This model would use latest information about the traveler, all of which can be gathered at the time of travel or before travel. Is the traveler a US citizen or a citizen of a friendly country? (US Citizen or Perm Resident = 1, Citizen of US ally = 2, Other countries = 3, Known terrorist nation = 5) Has the person bought the ticket in cash or by electronic payment? (Electronic payment = 1, Physical instrument such as a cheque = 2, Cash = 5) Does the person have a US contact? Is the contact a US citizen or a permanent resident? Is the person traveling to a valid residential address? What are the countries the individual has visited in the last 24 months? And so on. You get the idea. Now the weights that have been attached are quite arbitrary to start, but they can always be adjusted as the perception of these risk factors change and our understanding evolves.

Now what needs to be done is to update the parameters of this model every 3-6 months or so. Then every individual on the database as well as very person traveling needs to be scored using this model and high scorers (high risk of either having connections to terrorist network or traveling with some nefarious intent) can be identified for additional screening and scrutiny. These are the types of common-sense solutions that can be deployed to solve these types of ticklish problems. When the size of the problem has been reduced from 555,000 people on whom you need to spend the same amount of time, to one where the amount of scrutiny can be sloped based on the propensity to cause trouble, the problem suddenly becomes a lot more tractable.

Monday, December 28, 2009

Some end of year reading

1. Krugman on the America's own lost decade - link here
The usual Krugman rant on how things are going downhill and accelerating.

2. The Freakonomics blog on the practice of not inflation-adjusting stock returns - link here
Stock returns are seldom adjusted for inflation, transaction costs and taxes. While usually savvy investors do account for these factors, it is easy to get misled unless one reads the fine print.

3. How did buy and hold do in the last 10 years? link here
It is unfair asking a stock picker to pick just one stock. Makes for good headlines but does not really allow the stock picker to demonstrate their skills. The probability that any one company could be impacted by freak events is usually pretty high.

4. Health Statistics and the mammogram controversy - link here (from the WSJ) and here from the Numbers Guy blog
Reading any kind of reporting coming from the US (except sports, maybe) has become a painful drag through the ideology of the author.

5. Happiness - State of mind or state of body? - link here
An interesting 'light' piece of reading. Turns out that they are one and the same thing.

Saturday, December 19, 2009

The place of Systems Modeling in Analytics

When one talks about predictive analytics, the typical thought process goes in the direction of regression, neural nets, data mining techniques. Techniques that savvy marketers (consumer product companies, banks) have been using for close to two decades now in building insights about consumer behaviour. Systems modeling or Systems Dynamics is not something that immediately springs to mind.

So what is systems modeling all about? Systems modeling is creating a mathematical representation of a real-world phenomenon, trying to cover as wide range a set of inputs as feasible and the most valuable outputs. The systems model tries to explain how the inputs translate to outputs. How the systems model is different from a statistical predictive model is that the purpose of the systems model is not to try and explain variance in the output. The systems model instead tries to establish structural relationships between the input and the output. The model then further stresses the structural relationship by varying the inputs and looking at the impact on the output.

A good example of a subject that can be systems-modeled (my verb!) is the problem of terrorism. The problem has different inputs: unhappy people, territorial disputes, foreign power wanting to create trouble, funding, media coverage, etc. The immediate output is various actions of terrorism such as assassinations of leaders, suicide bombings, etc. It might be feasible to build a model that creates a structure on how these various inputs combine and interact with one another and cause the outputs. (If one goes back over the past 150 years, there should be plenty of data points.) Another way of looking at the output is a more holistic view that measures the damage done in terms of lives lost, economic damage incurred, etc.

What would be the purposes of this model? In my opinion, the value of such a model is less around where the next terrorist strike is going to be, or how big the next strike is going to be. (This is incidentally what a classic statistical model is going to try to do.) But rather, the model should try and explain what are the confluence of factors that produce a large output event (lives lost, economic damage) and how can some of the factors be controlled, ONCE an insurgency is already underway. The hypothetical model I am talking about does not try to predict, but rather to strengthen our understanding of the system dynamics. The model would have a PoV on what inputs can be controlled and to what extent are they controllable.

The model would then be used to understand how a large impact event can be prevented or its impact minimized. So if the federal government had a $100 billion to spend, how much should they spend on homeland security vs. promoting a positive image of the United States through foreign media? The model might tell that it is pointless to spend more than, say, $500 million on putting in a sophisticated software to block large untraced wire transfers as there are other ways in which the funding can be made available to the perpetrators of the terrorism act. So controlling the funding for an insurgency through sophisticated money laundering and layering detection algorithms may be pointless if the actual money gets exchanged through a non-electronic channel.

So an agency interested in curbing terrorism, might be better advised in, say, over-investing in trauma care health facilities and emergency services in vulnerable areas. This is so that when a strike does take place, medical help for the people who are affected is close at hand and casualties are minimized.

Why am I writing all this? Analytical problem solving is not just about fancy statistical algorithms or cool math, it is also about thinking hard about problems and creating their mathematical representations - and then being crystal clear about what those mathematical representations can and cannot do. This is where the systems modeling approach can be a very effective portion of the arsenal of a business modeler.

I'll close out with a couple of links, which prompted this wave of thinking on this subject. One is a paper in the Nature journal where the authors have presented a statistical model of insurgency events. The link is here. It's a gated article.

The following link has a very good critique on the article.

Monday, December 14, 2009

The new lean economy

I have commented earlier (link here) on the phenomenon of the jobless recovery that the US economy (and definitely many of the more 'open' economies) are likely to be facing in the next few years. Of course, this precludes any serious effort by the government to create jobs through stimulus like efforts - though there is going to be a limit to that as well, given the relation of stimulus efforts to future debt creation.

In my opinion, the 2008-09 Great Recession has forced companies to seriously evaluate their cost structures. A lot of what passed before has been cut (in the fat bubble years) and companies have begun to realize serious benefits from cutting out fat, leveraging efficiencies at the work place by eliminating redundancies, moving their applications to open source platforms and so on. And my intuition is that many of these changes are not going to be just a reaction to the downturn. Companies are seeing that the quality of output has not significantly suffered because there aren't enough people to do the work, or because the work is no longer being done by expensive software. Thomas Friedman, in a piece in the New York Times, wrote that the Great Recession has also brought about a Great Inflection.

According to Friedman, the Great Inflection is "the mass diffusion of low-cost, high-powered innovation technologies — from hand-held computers to Web sites that offer any imaginable service — plus cheap connectivity. They are transforming how business is done." Friedman talks about two examples in his piece. One is a small, not-for-profit that needs to create an ad campaign. Given constrained budgets, the need of the hour is to be innovative, but with cost efficiencies firmly in mind. The ad creator uses a mix of collaboration tools (enabled by cheap and high throughput communication), online sourcing (through the availability of online marketplaces for media products) and multimedia editing (enabled by software and hardware improvements) to deliver a solution that is both innovative and appealing as well as one that fits in the client budget.

The second example, of the furniture manufacturer Ethan Allen, talks about transformations the business has had to make to drive productivity improvements. The transformations have been both traditional: workforce reductions of over 25%, multiskilling of the remaining workers to make them more fungible, consolidation of manufacturing and process engineering. Additionally the company has also adopted other non-conventional means to conserve cash and survive. This has included moving a lot of the advertising activity in-house leveraging the multimedia desktop tools that are available today.

Finally, Friedman makes a point that the flow of credit (which is still very constrained) would make these companies create jobs. I disagree. I think it is going to take a lot for companies to start hiring again. And when they do, they are going to be the multiskilled talent that is now constituting the workforce at Ethan Allen. Companies across the spectrum have tasted blood - of keeping productivity and output high and costs low. They are not likely to go back to being fat again anytime soon.

In Indian banks, I am seeing an increasing phenomenon in the growth of branches. Most big banks are expanding their branch networks - like HDFC Bank, ICICI Bank and even the venerable State Bank of India. But the branches increasingly are being staffed at low staff levels. Staff is usually multi-skilled. Specialists are assigned across branches and are mobile. As a customer, if you need any specialized service, the representative at the branch contacts the specialist who then makes an appointment within 24 hours. Instead of having committed staff in every branch, the staffing model comprises fungible generalists allotted to branches and shared, mobile specialists across branches.

Tuesday, December 8, 2009

Too Big To Fail - contd.

I have commented earlier on the TBTF doctrine.

Recently, I came across a couple of other references on the TBTF situation and what to do about it. The first from the FT has the author Willem Buiter presenting a slew of solutions on what to do about banks becoming TBTF. (Interesting how this abbreviation seems to have taken on a life of its own!) Definitely worth taking a more detailed read as the author goes to a fair degree of detail on what are some of the probable solutions.

The second is a novel way of valuing the benefit that banks get from becoming TBTF. The approach (proposed by a couple of economists Elijah Brewer and Julapa Jagtiani from the Philadelphia Fed) argues that the measure of the benefit that banks expected to get could be ascertained by the acquisition premium paid by these very banks along their journey to becoming TBTF. The estimate of this premium (looking at acquisitions from 1991 to 2004) is about $14 billion. This link references the actual paper written by the economists.

Given my obsession on getting to an optimal risk management framework for financial institutions, I thought I'd share a couple of these links.

Interesting reads from Dec 8

Interesting reads from the Net

1. Consumer credit declines for the 9th straight month - link

2. NY Fed remarks on lessons from the crisis - link

3. Credit/ Leverage and its role in creating financial crises over the years - link

I particularly liked this excerpt:
Long-run historical evidence therefore suggests that credit has an important role to play in central bank policy. Its exact role remains open to debate. After their recent misjudgements, central banks should clearly pay some attention to credit aggregates and not confine themselves simply to following targeting rules based on output and inflation.

4. The Simpson' paradox always fascinates me. This example uses unemployment rate comparisons between today and the 1981 recession - link

4b. This response by Andrew Gelman talks about when the comparison at the sub-group level is appropriate (when the definitions of the sub-groups being compared between the two samples are robust and more apples-to-apples) and also where the aggregate level is more appropriate (where the definitions have not remained stable - typically happens when the two samples are temporally divided - and therefore any comparison is not necessarily apples-to-apples) - link

Holidaying in India and just beginning to recover from the sensory overload (of family, friends, food, the media, the general environment). Really looking forward to the remaining two weeks.