Big Data Analytics: 2011

Wednesday, December 28, 2011

Car Insurance savings and too-clever marketing

A quick rant post.

I have been reflecting a bit on GEICO, Progressive and others claiming how you can save a lot of money (15%, so many dollars) by switching to that company. A highly deceptive form of advertising and here's why.

First, to start off, the marketing message taken at face-value seems to imply causation - switch to company X and you will save money. In reality, the sequence of events is the opposite. People typically shop for a quote and then when they find the quote saving them money over what they currently have, they switch over. So it is likely that for every person who switches, there are one or more people who don't switch because they don't save any money or they save too little for it to be worth the hassle. So to say that switch and you will save money is somewhat disingenuous. Only some people save money with Company X and they are the ones that switch.

The second part of deception comes in the dollar amount of the switch. The way this information is gathered is typically by surveying customers that have switched. Why is this deceptive? Well, because a number of behavioral economists studies have shown that we human beings tend to rationalize. We tend to give ourselves more credit than necessary or justifiable in general. This manifests itself in a number of ways such as most people thinking they are above-average drivers, people over-estimating investment returns they make and so on. So when a customer has made the (what the customer thinks) is the extremely smart decision to switch, they are likely to also over-estimate the savings that they have realized as they are proud of the switch decision they just took. And so it is very likely that the savings number is inflated to some extent.

So save x% by switching to GEICO is actually a smart ploy to get people to ask for a GEICO quote. Doesn't hurt at all to get one, in an extremely crowded market-place. But promising savings in the language that these companies use doesn't seem very above board.

Monday, December 12, 2011

Great Recession - A new theory linked to productivity improvement

I wrote a couple of years back on what has come to be known as the Great Recession of the twenty-first century. I remarked that the recession appears to show no signs of abating and recent events seems to have borne that out. While GDP growth in the US is in positive territory, it barely is. And the problems in Europe and a couple of natural disasters affecting Asia (the earthquake in Japan and the flooding in Thailand) have put brakes on the emerging markets engine that was pulling the world economy along for the last 4 years.

In the meantime, a number of well-argued articles and books have been written about the genesis of the crisis, and they have largely focused on the financial sector, the US mortgage market and the excesses there. The Nobel Prize winning economist, Joseph Stiglitz, approaches this issue from a slightly different angle in a recent write-up in Vanity Fair. Stiglitz argues that the Great Recession has its roots in something more benign than mortgages gone toxic. It lay in the productivity increases in the last two decades and caused a large number of job categories employing very large portions of the labor force to basically become redundant in the economy. What is interesting about this theory is that (Stiglitz argues) this is exactly what happened leading up to the Great Depression. The productivity improvements now are in the areas of manufacturing and services and the productivity improvement then was in agriculture.To quote, In 1900, it took a large portion of the U.S. population to produce enough food for the country as a whole. Then came a revolution in agriculture that would gain pace throughout the century—better seeds, better fertilizer, better farming practices, along with widespread mechanization. Today, 2 percent of Americans produce more food than we can consume.

Extremely interesting article and a forcefully made argument on the cause of the crisis and what could be done to solve it.

Saturday, December 10, 2011

Computing based on the human brain - the answer to Big Data?

A slight detour from my usual subjects around predictive analytics. I came across this recent article that is prescient of the direction of modeling and predictive analytics in general. And that is the move away from the current model of computer design, based on the famous von Neumann architecture to something that is much more similar to the thing computing and modeling and decision making are ultimately designed to emulate, viz. the human brain.

IBM Watson - Super-computer or energy hog?

First some background. Computer architecture has consistently followed the classic von Neumann architecture. Without getting into too many details, what the architecture boils down to is a separate processing unit (known variedly as CPU, ALU, microprocessor) and a separate memory unit, both connected by a communication channel called a Bus. This architecture has served computing well over the past 50 years, and now has brought the computer within access of every single human being on Earth. The fact that 2-year old toddlers are extremely adept with the Apple iPad is testimony to the success of the von Neumann model. After all, nothing succeeds like success. Even as processor chips have become more advanced and started incorporating their own internal memory module (called cache memory), the von Neumann architecture has been faithfully replicated. But successful doesn't mean ideal or optimal or even efficient. The burn of the laptop on my thigh as I type this post is indication that the current computing model, while successful, is also an extremely power-hungry one. The IBM- Watson machine, famous for playing and beating human opponents in Jeopardy, is also famous for consuming 4000 times the power of its human competitors. The human brain functions with about 20 watts of power while Watson consumes more than 85,000 watts. And all that Watson can do is play Jeopardy. The human brain can do a lot more like writing, recognizing pattern, expressing and feeling emotion, negotiating traffic, even designing computers!

So what might a more efficient model look like? Well, it looks a little more like the human brain. The human brain has both logical problem solving, thinking as well as memory managed through one element of computing infrastructure, so to speak, which is the neuron interconnected through synapses. And that is the model that is being pursued by IBM in collaboration with Cornell, Columbia, the University of Wisconsin and the University of California, Merced. The project is also funded by DARPA, and more details can be found at the link at the start of the page. The big a-ha moment according to the project director and IBM computer scientist, Dharmendra Modha (in the middle of vacation, no less) was to drive the human-brain driven computing project through the fundamental design of the processor chip or the hardware rather than through software. To quote some details from the New York Times article by Steve Lohr,
The prototype chip has 256 neuron-like nodes, surrounded by more than 262,000 synaptic memory modules. That is impressive, until one considers that the human brain is estimated to house up to 100 billion neurons. In the Almaden research lab, a computer running the chip has learned to play the primitive video game Pong, correctly moving an on-screen paddle to hit a bouncing cursor. It can also recognize numbers 1 through 10 written by a person on a digital pad — most of the time.

Why is this relevant to predictive analytics?
What is a mention of this project doing in a predictive analytics blog? It has to do with Big Data. Online, mobile, geo-spatial and RFID technologies are creating streams of data in amounts that would have been impossible to conceptualize even a decade back. As the availability of data increases and the power of conventional computing infrastructure and storage infrastructure gets overwhelmed, we will have to rely on a distributed memory storage and computing set-up that is more similar to the human brain. A space worth watching.

Thursday, December 8, 2011

Tesco Clubcard - Metrics and Success Factors

Getting back to this subject after a really long break. In the first part on this subject, we reviewed Tesco’s loyalty program and the types of business decisions aided by the Clubcard. The Tesco crucible maintains information about:
1. Customer demographics
2. Detailed shopping history
3. Purchase tastes, frequency, habits and behaviours
4. Other individual level indicators obtained from public sources

Tesco then uses this information for a number of business benefits such as:
1. Loyalty
2. Cross Sells
3. More optimal inventory and store network planning
4. Optimal targeting and marketing of manufacturer’s promotions
5. Generating customer insights and marketing those insights

The link to the previous article that details out these points is here

So what else goes into making this program successful?

Metrics
One important factor is the metrics used by Tesco to measure success. Primarily two metrics. The first is the change in share of wallet. Based on demographic information collected, Tesco has an estimate of the total spend of that household. Based on that estimate, a share of wallet can be computed based on Tesco sales. This is of course an estimated measure, but given the right kinds of assumptions, not a particularly ambitious estimate to make.

(The key here is make sure the estimates are generated in an unbiased manner. An estimated metric is always prone to manipulation. For instance, a small increase in unit sales can be projected to be a larger increase in share of wallet, by manipulating the projected overall spend. This problem can be avoided if the estimation is done by an independent group that is incentivised to get its estimates right and not as much on the volume of sales. This is the role of Decision Sciences groups found in many organizations.)

A related measure of share of wallet is the number of purchase categories into which Tesco has penetrated. Remember that Tesco is present in many purchase categories such as groceries, apparel, durables, banking products, vacation packages, insurance, auto sales, pharmacy products, gas, etc. Effectiveness of the Tesco brand is realized when the customer begins to use Tesco for multiple product categories. So that is a useful metric to track, both as an indication of overall profitability as well as marketing and cross-sell effectiveness.

The second main metric being measured is just pure customer behaviour from a frequency standpoint. How is the company changing the frequency of visit of customers, and what sorts of visits are they getting from them? Of course, with the wide use of smart phones and the tracking devices which are inherent in these phones, it is possible to gather a lot of spatial and temporal information such as: Which store? Duration of the visit? At what time of the day or week?

Other Success factors
No company can maintain sustained growth and profitability on the strength of purely analytics without addressing the human face of the analytics - in other words, the customer service aspect. Tesco management was clear to convey the message to the store staff that the Clubcard program was an important value-add for customers and hence an inherent part of customer service. That it wasn’t fundamentally manipulative. This was done through a communication program that was rolled out across all stores and that involved every store employee of Tesco.

The other important success factor that was critical was management vision. Many organizations tend to see these programs as cost drivers and strive to minimize cost while maximizing customer satisfaction, often conflicting goals. But the Tesco management was clear about the ultimate goal of the Clubcard which is to drive loyalty. What also helped was the breadth of vision that allowed for multiple revenue streams from the ClubCard program that were not directly related to the core idea of give-back to the customers and loyalty benefits.

Another philosophy that the Tesco management employed fairly successfully was test-and-learn. Most of the major improvements and enhancements were first piloted in smaller stores. Extremely rigorous measurement mechanisms were then employed to make sure that the right inferences were drawn from the test.

Overall, the key realization was that the Clubard program is not just an electronic sales promotion, but rather the entire business has to be physically re-engineered to be customer-insight led.

In my final piece, I will touch on the impact to the overall bottom-line - and the top-line benefits that came from the Clubcard program.

Sunday, August 21, 2011

Analyzing Tesco - the analytics behind a top-notch loyalty program

My specific interest within predictive analytics have been as much about the technology and the data mining techniques that can be applied to the data, as much as it has been about the business value that can be extracted from the data. With this second interest in mind, I am going to embark on a series of a different kind of blog posts.

Instead of mostly talking about theory, I am going to share examples of how companies are using the power of analytics to know their customers better, anticipate their needs and ultimately become more profitable. One of the beacons in this space on whom many a volume has been written is the Tesco, the UK (and now increasingly international) retailing giant. What I am going to cover in this piece is the Tesco loyalty card, how it works and the different ways in which a retailer can take advantage of the information base created by the card to generate economic value.

First some background. Tesco hired the marketing firm of dunnhumby to develop a new loyalty program, to enable it to grow in the UK market. By 1995, the Clubcard launched with nearly instant success as Tesco enjoyed a large increase in customer loyalty and retention. Within the first five years sales had risen over 50%.

The structure of the card, and how the data is collectedThe data gathering for the loyalty program starts with a typical application which might ask for some basic demographic information such as address, age, gender, the number of members in a household and their ages, dietary habits.. Against this basic information, purchase history is appended. This includes the goods shopped for, and also information such as visit history, both to stores and online.

Next, a number of summary attributes are also computed. These include share of wallet information, information on frequency and duration of visits. Also information on customer preferences and tastes, as determined by some clever cluster analysis based on purchase history of specific fast-moving products. See this link for a review of a book describing the Tesco loyalty program called “Scoring Points”.

Tesco realized that better information leads to better results and created Crucible—a massive database of not only applicant information and purchase history, but also information purchased and collected elsewhere about participating consumers. Credit reports, loan applications, magazine subscription lists, Office for National Statistics, and the Land Registry are all sources of additional information that is stored in Crucible.

To summarize, Tesco maintains information about:
1. Customer demographics
2. Detailed shopping history
3. Purchase tastes, frequency, habits and behaviours
4. Other individual level indicators obtained from public sources

Creating this database is an undertaking in itself. Many organizations realize the value of such detailed data and are able to spend the resources to get it; however, they do such a bad job of integrating the data and making it available to analysts that only a fraction of the power in the data is realized.

Technology challengesWhat were some of the challenges faced from a technology standpoint? To start with, one of scale. Specifically, how to scale up from an analytical lab scale to servicing 10 million customers. In the words of Clive Humby of dunnhumby, “we're very pragmatic, so to begin with, we worked on a sample of data. We'll find the patterns in a sample, and then look for that pattern amongst everybody, rather than just trying to find it in this huge data warehouse.”

Sir Humby has revealed some interesting insights in this interview.

Tesco uses a hybrid mix of technologies: Oracle as the main data warehouse engine, SAS for the actual modeling, and White Cross and Sand Technology as the analytic engine for applying the learnings to larger volumes of data. Additionally, the technology group used a number of home-developed technologies and algorithms. This is a nice blog written by Nick Lansley about the technology choices made by Tesco - with some filtering, of course.

And finally, the business value or the economic benefits
1. LoyaltyThe first clear benefit is customer loyalty and the increased spend that comes from a customer moving most of their purchases on to Tesco. The loyalty program incentivizes customers to steer a greater share of their monthly grocery spend onto Tesco, which in turn explains the increase in market share for Tesco from about 15-20% to about 30% of the total UK market in the period from 1995 to about 2005.This is a clear objective of any loyalty program and Tesco delivers on the business objective brilliantly. Tesco does this by offering vouchers on associated products - so if a family is buying infant formula, it is quite a straightforward decision to offer them discounts on diapers and get the customer to move that part of the purchase also to Tesco.

2. Cross-sellsThe most immediate extension from increasing spend within one product category is cross-selling across product families. So an example of this would be (from the previous example) marketing a college-fund financial product to a family that has newly got into infant food and diapers purchases. The way Tesco would do this, I would imagine, is to have a family or customer level flag for “Has small children” or something of the sort. An alternative would be to see Disney Cruises to a family with small children. In this case, Tesco would not only collect a channel fee from Disney for selling their cruises through its site but also a premium for being demonstrably targeted in their marketing.

3. Inventory, distribution and store network planningThe first two applications are about knowing consumer needs better and targeting available products and services more effectively. The next benefit from this data is from materials movement. By getting a precise handle on demand and particularly, anticipating demand spikes in response to promotions, the company can do an effective job with its demand planning and managing the distribution pipeline efficiently from the (edits begin) manufacturing points to the distribution centers.

Also, based on the demographic (customer self-reported) and public information that is appended to the customer level database, a basis is created for inventory planning. So lets say Tesco wants to open a store in a region where there are a large number of families with young children residing, it becomes possible to anticipate the demand for baby products if a Tesco branch were to be opened in that region and stock up accordingly.

4. Optimal targeting and use of manufacturer promotionsAnother area of value for Tesco is optimal use of manufacturer’s promotions, such as either direct purchase discounts or one-for-many type schemes. At the outset, it might appear that retailers like Tesco would love manufacturer’s coupons and rebates. Woo wouldn’t like it if there was greater foot-traffic and purchase activity that came from a scheme, and all the cost was borne by the manufacturer? In reality though, things are never as simple as that. Retailer don't really want to run too many promotions, because managing promotions (displays, new labeling, frequent restocking, possible overstocking and the cost of damaged or expired inventory) is very labor intensive and also adds to the supply-chain costs.

So one of the areas that Tesco specializes in is promotion optimization. Which means, given the 100s of promotions available at any given point in time, which 25-30 to pick and suugest a price to negotiate with the manufacturer. The optimization is based on:
- The cost of running the promotion including inventory costs and labor costs
- Local geography based factors - what kind of customers shop at a local store and what are their unique preferences
- Ensuring there’s something for everyone - ensuring every customer has a fair chance of getting a few promotional offers, given their typical purchase behaviour

5. Consumer insight generation and marketing those insightsA final area of economic value for Tesco is gleaning higher-level customer insights that other entities would be interested in. For example, Procter and Gamble would be EXTREMELY interested in knowing how households of different sizes and at different points of the economic spectrum buy and use laundry detergent. And how that use changes with seasons, over time and so on. Also, what is the propensity for such customers to buy and use related products such as, say, fabric softeners.

Given Tesco’s vantage point and their detailed view of what a customer’s purchases really looks like, it becomes really easy for Tesco to glean such insights from the data and see the information to a bunch of interested parties. This is another source of economic value for Tesco.

This post is getting really long - so let me stop here and summarize. We just discussed the types of data that is gathered by a top-notch loyalty program like Tesco’s and also what are all the sources of economic value from this data that Tesco gathers. In my next posts, I will talk about the potential value from such a program for Tesco and its comparable costs. What have been some of the unique and honestly hard-to-replicate factors that have helped Tesco succeed in this space. Also, what have been some of the competitive responses and how is this area evolving in the emerging SOcial, LOcal, MObile (or SoLoMo) world.

A set of interesting links about Tesco's loyalty program.
http://www.guardian.co.uk/lifeandstyle/2003/jul/19/shopping.features
http://www.customerthink.com/interview/clive_humby_tesco_shines_at_loyalty
http://blog.ouseful.info/2008/11/06/the-tesco-data-business-notes-on-scoring-points/
http://techfortesco.blogspot.com/

Wednesday, August 3, 2011

Tips for data mining - part 4 out of 4

My labor of love which started nearly seven months back is finally drawing to a close. In previous pieces, I have talked about some of the lessons I have learned in the field of data mining. The first two pieces of advice which were covered in this post were
1. Define the problem and the design of the solution
2. Establish how the tool you are building is going to be used

The next pieces were covered in this post and they were
3. Frame the approach before jumping to the actual technical solution
4. Understand the data

In the third post in this epic story (and it has really started feeling like an epic, even though it has just been three medium length posts so far), I covered:
5. Beware the "hammer looking for a nail
6. Validate your solution

Now based on everything I have talked about so far, you actually go and get some data and build a predictive model. The model seems to be working exceptionally well and showing high goodness-of-fit with the data. And there, you have reached the seventh lesson about data mining which is

7. Beware the "smoking gun"
Or, when something is too good to be true, it probably is not true. When the model is working so well that it seems to be answering every question that is being asked, there is something insidious going on - the model is not really predicting anything but just transferring the input straight through to the output. It could be that a field that is another representation of the target variable is used as a predictor. Lets take an example here. Let us say we are trying to predict the likelihood that a person is going to close their cellphone plan, or in business parlance, the likelihood that the customer is going to attrite. Also, let's say one of the predictors used is whether someone called up the service cancellation queue through customer service. By using the "called service cancellation queue" as a predictor, we are in effect using the outcome of the call (service cancellation) as both a predictor as well as the target variable. Of course the model is going to slope extremely nicely and put everyone who met the service cancellation queue condition as the ones most likely to attrite. This is an example of a spurious model, it is not even a bad or an overfit model. Not understanding the different predictors available (or rather not paying attention to the way the data is being collected) and providing justification as to why they are being selected as a predictor in the predictive model is the most common reason why spurious models get built. So when you see something too good to be true, watch out.

8. Establish the value upside and generate buy-in
Now lets say you manage to avoid the spurious model trap and actually build a good model. A model that is not overspecified, independently validated and using a set of predictors that are tested for quality and have been well understood by the modeler (you). Now the model should be translated to business value in order to get the support of the different business stakeholders who are going to use the model or will need to support the deployment of the model. A good understanding of the economics of the underlying business model is required to value the greater predictive capability afforded by the model. It is usually not too difficult to come up with this value estimate, but this might seem like an extra step at the end of a long and arduous model build. But this is a critically important step to get right. Hard-nosed business customers are not likely to be impressed by the technical strengths of the model - they will want to know how this adds business value and either increases revenue, reduces costs or decreases the exposure to unexpected losses or risk.

So, there. A summary of all that I have learned in the last 4-5 years of being very close to predictive analytics and data mining. It was enjoyable writing this down (even if it took seven months) and I hope the aspiring data scientist gets at least a fraction of the enjoyment I have had in writing this piece.

Monday, August 1, 2011

Good documentation about data - a must for credible analytics

One of the cardinal principles of predictive analytics is that you are only as good as the data that you use to build your analysis. However, another important principle is that the data handling processes also have to be well managed and generally free of error.

Recently a set of incidents came to.light which talked about the damage that can be caused by indifferent data handling process. This was in the field of cancer research which points to the human cost from some of these mistakes. One of the popular recent techniques in cancer research analysis of gene level data is micro array analysis. A primer on what this analysis involves can be found in this link here

Duke University cancer researchers promised some revolutionary new treatments of cancer. But when patients actually enrolled in trials, the results were disappointing. Then the truth came out. The analysis was done wrong and the reports resulted from some elementary errors in data handling by the researchers. Two researchers, Baggerly and Coombes, who had to literally reverse engineer the analytical approach used concluded that some simple errors resulted in the wrong conclusions.

A few takeaways for a data scientist:
1. Data handling scripts and processes need to be checked and double-checked. Dual validation is a well-known technique; it is also known as a parallel run. The idea here is to have two independent sets of analysts or systems to process the same input data and make sure the outputs are the same.

2. Data handling needs to be well-documented. The approach used to arrive at a set of significant findings can never be shrouded in mystery, either intentionally or because of sloppy documentation. At best, it gives the appearance of slipshod and careless work. At worst, it appears like a deliberate deception. Neither of these impressions are good ones to make.

A summary presentation from Baggerly and Coombes about this issue can be found here.

Monday, July 4, 2011

Errata: Simple vs Complex models

The comment I attributed to Schumpeter in my last post on Simple v Complex models actually belongs to EF Schumacher, the writer of "Small is Beautiful". Have that book lined up in the library reading list.

To summarize why I like simpler models,
1. More interpretable: Particularly important when there is data overload happening and one isn't really sure what is signal and what is noise
2. Easier to maintain and update as part of a production system
3. Likely a truer representation of the world: Going back to good old Occam's Razor principles. At least in a way that lends itself to meaningful decision making

Sunday, June 26, 2011

Simple vs complex models

I came across the subject of simple models (or crude models - somehow didn't like the word crude, it sounded .... well, crude) vs more complicated models in this very interesting blog by John D Cook called "The Endeavour". The link to the article is here. There was a good discussion on the pros and cons between simple and complex models, and so I thought I'd add some of my own thoughts on the matter.

First, in terms of definitions. We are talking about reduced-form predictive analytics models here. Simple or crude models are ones that use a few number of predictors and interactions between them, complex models are ones that use many more predictors and get closer to that line between a perfectly specified and an overspecified model. John Cook makes the article come alive with an interesting quote from Schumpeter … there is an awful temptation to squeeze the lemon until it is try and to present a picture of the future which through its very precision and verisimilitude carries conviction.

Simple models have their benefits and uses. They are usually quicker to build, easier to implement, easier to interpret and to update. I particularly like the easier to implement and easier to interpret/ update bits. I have seldom come across a model that was so good and so reliable that it needed no interpretation or updating. The fact of the matter is that any model captures some of the peculiarities in the training data set used to build the model, and therefore, by definition is somewhat over-specified for tht dataset. There is never a sample that is a perfect microcosm of the world - if there were, it wouldn't be a sample at all and rather, it would be almost the world that it is supposed to be a representation of. So any sample and therefore any model built off it is going to have biases. A model builder therefore would be well served to understand and mitigate those biases and build an understanding that is more robust and less cute.

Also, the implementation of the model should be straightforward. The model complexity should not lead to implementation headaches, whose resolution end up costing a significant portion of the purported benefits from the model.

Another reason why I prefer simpler models is their relative transparency when it comes to their ultimate use. Models invariably get applied in contexts way different from what they were designed to do. They are frequently scored on different populations (i.e., different from the training set) and used to make predictions and decisions that again are far removed from what they were originally intended to do. In those situations, I eminently prefer having the ability to understand what the model is saying and why, and then apply the corrections that my world experience and intuition tells me. Versus relying on a model that is such a "black box" that it is impossible to understand and therefore leads to this very dangerous train of thought that says "if it is so complex, it must be right".

Friday, June 17, 2011

Tips for data mining - Part 3 out of 4

I had written two pieces early on in the year about data mining tips. These talked about the first four tips to keep in mind while undertaking any data-mining project.
1. Define the problem and the design of the solution
2. Establish how the tool you are building is going to be used
3. Frame the approach before jumping to the actual technical solution
4. Understand the data

The links for the first two parts can be found here and here. Now let me talk about the fifth and sixth parts on what I have learned.

5. Beware the "hammer looking for a nail"
This lesson basically recommends that you make sure you are using the appropriate complexity and sophistication of the analytical solution for the problem at hand. It is very easy to get excited about any one analytical solution and try to apply it to every single problem that you come across. But approaching a business problem like a "hammer looking for a nail" creates a set of issues. One, application of the technique becomes more important that understanding the problem. When that happens, the desire to implement the technique successfully becomes more important than solving the problem specifically. Two, the solution can sometimes reach a level of complexity that the problem does not really need - on several occasions, simple solutions work the best.

6. Validate your solution
One of the most common mistakes that a data miner can make when confronted with a problem is to produce an overfit model. An overfit model is an overspecified model - many of the relationships between the predictor and target variables implied by the model are not real. They are an artifact of the dataset used to build the model. The problem with overfit models is that they tend to fail spectacularly when applied to a different situation. Therefore, it is crucial to do out-of-sample validation of the model. If the model does not do a good job validating on the validation sample, it usually means an overspecified model. (Holdouts from the original build dataset - the typical two-third/ one-third breakup between the build and validation datasets - don't really result in an independent validation.) The model might need to be simplified. One way to do it is to examine all the relationships between the predictor variables and the target variable and make sure they are sensible and believable. Another way of simplification is to make sure only linear relationships are maintained in the model. To be clear, this is often a gross oversimplification - but it is sometimes better than an overspecified model that is unusable.

So this was tip #5 and 6. I will soon close out with the last two tips. Thanks for the patience with my slow pace of writing on this. This is the link to Tom Breuer's tips on building predictive models. A lot of good ideas here as well.

Wednesday, June 1, 2011

Principal Components - the math behind it

A really delightful tutorial on the mathematical basis for Principal Components Analysis. It really clarified a lot of the basics for me.

Monday, May 30, 2011

Interesting uses of Principal Components Analysis

I had shared a link on Principal Components Analysis a while back and have had the opportunity to revisit this space, or rather visit it in a professional capacity recently.

As a part of my interest, I came across a few interesting links on this subject - one of the better tutorials is here. The primary purpose behind PCA is dimensionality reduction to make analysis more efficient. Typical applications have been in the area of image processing, though of late, there has been a lot of interest in applying these techniques for micro-array data.

Some of the historical applications of PCA has been in the field of statistical process control or SPC. The genesis of the application came from the chemical industry, and the early practitioners were interestingly known as chemometricians. The aim here was to model plant yield as a function of its input parameters. The input parameters were typically the temperatures and pressures at different points in the reactor vessel and also recorded at different points in time. Since the plant operator has control over some of these parameters, these can be varied in order to improve the plant yield.

The sheer complexity of the data involved here is one complication. When processes have hundreds on inputs (temperature, pressure, gradients, energy released, moisture content - all captured by hundreds of sensors embedded within the reactor), it becomes difficult to build any meaningful models - given the limited number of observations available. What is helpful is that many of these input variables are highly correlated. The temperature at the entry point of reactor feed is going to be obviously correlated to the temperature at the center of the reactor vessel. PCA can be used to reduce the dimensionality of the inputs and model the outputs as a function of the principal components rather than the input variables. Principal components by definition is simply reducing the hundreds of correlated inputs into a few principal components (typically 3 or 4) which are a linear combination of these raw inputs. The other application here is the monitoring of these reactions. When the operator runs different reactions with different input parameters, it is important to identify 'outliers'. Places where the inputs have been so far away from norms that the outputs need to be appropriately flagged or in some cases, totally discarded. Some more details on the application of these techniques can be found here. The link goes to a paper on Principal Component Techniques by Robert Rodriguez from SAS.

These applications can be extended to other areas as well. In consumer behaviour modeling, PCA can be used to reduce the hundreds of different inputs about a consumer to the essential principal components and these can then be used to simplify the modeling and monitoring processes.

Wednesday, May 18, 2011

The Heritage Prize

The latest in data mining competitions is the Heritage Prize. If you haven't heard about the prize before, it is a competition to bring predictive analytics to the health-care business. The reward: a cool $3 million. This is the next phase of predictive analytics - when corporates are willing to pay good money for great analytical work.

Saturday, March 5, 2011

Tips for data mining - Part 2 out of many

Writing after a long time on the blog. Blame it on regular writer's cramp - a marked reluctance and inertia of sorts to put pen to paper, or rather fingers to keyboard.

My last post introduced the idea of defining the problem as the first step for any data mining exercise aiming to achieve success. This is the ability to state the problem you are trying to solve in terms of business outcomes that are measurable. After that comes the step of envisioning the solution and expressing it in a really simple form. The aim should be to create a path from input to output - the output being a set of decisions that will ultimately result in the measurable business outcomes we mentioned above. The next step involves establishing how the developed solution would be used in the real world. Not doing this early enough or with enough clarity could result in the creation of a library curio. Defining how the solution will be used will also point to other needs such as training the users on the right way to use the solution, the expected skills from the end users and so on.

In this post, we will discuss the third and fourth steps. These are
3. Frame the approach before jumping to the actual technical solution
4. Understand the data

Frame the approach before jumping to the actual technical solution. Once the business problem has been defined, it is tempting to point the closest tool at hand at the data and starting to hack away. But often times, the most obvious answer is not necessarily the right answer. It is valuable to construct the nuts and bolts of the approach to get to the solution on a whiteboard or sheet of paper before getting started. Taking the example of some recent text-mining work I have been involved in, one of the important steps was to create an industry-specific lexicon or dictionary. While creating a comprehensive dictionary is often tedious and dull work, this step is an important building block for any data mining effort and hence deserves the upfront attention. We couldn't have seen the value of this step, but for the exercise of comprehensively thinking through the solution. This is also the place where prototyping using sandbox tools like Excel or JMP (the "lite" statistical software from the SAS stable) becomes extremely valuable. Framing this approach in detail allows the data miner to budget for all the small steps along the way that are critical for a successful solution. It also enables putting something tangible in front of decision makers and stakeholders which can be invaluable in getting their buy-in and sponsorship for the solution.

Understand the data. This is such an obvious step that it has almost become a cliche; having said that, incomplete understanding of the data continues to be the reason why the greatest number of data mining projects falter in attempting to fulfill their potential and solve the business goal. Some of the data checks like data distributions, variable attributes like mean, standard deviations, missing rates are quite obvious but I want to call out a couple of critical steps here that might be somewhat non-obvious. The first is to focus extensively on data visualization or exploratory data analysis. In the blog, I have written a few pieces before on data visualization which can be found here. Another good example of this type of visualization is from the Junk Charts blog. The second is to track data lineage - in other words, where did the data come from and how was it gathered. Also is it going to gathered in the same way going forward. This step is important in understanding whether there have been biases in the historical data. There could be coverage bias or responder bias, where people are invited or requested to provide information. In both these cases, the analytical reads are usually specific to the data collected and cannot be easily extrapolated to non-responders or people outside the coverage of the historical data.

This covers the background work that needs to take place before the solution build can be taken up in earnest. In the next few posts, I will share some thoughts on the things to keep in mind while building out the actual data mining solution.

Saturday, January 15, 2011

Tips for data mining - part 1 out of many!

Having spent a good part of the last four years in mining data and working in general in the business/ predictive analytics areas, I thought I'd take a step back and summarize some of my lessons learnt through data mining. I was inspired to do this based on a very revealing article by Tom Breur, principal at XLT Consulting. More on Tom and his writings later.

So what have I worked on these past years? As a management consultant in my previous life, working with data and tons of it was a given. Using data extensively, building business analytics models which aim to replicate real world processes, establishing objective criteria to take decisions is all bread-and-butter ways of problem solving in the consulting world. In fact, I'd go as far to say that consultants seriously suffer from a lack of self-confidence when they consider out of the box solutions that do not include any/ all of the above. My consulting experience was mainly in the area of consumer goods distribution and marketing. Then moved to retail financial services. The main area of experience there has been in credit risk modeling and consumer cash flow modeling. Modeling how much of these events (credit risk, cash flow) are driven by internal factors and how much by exogenous occurences. Also modeling consumer response to marketing products. A recent foray has been into text mining unstructured responses from applicants.

So what have I learnt? I will try and summarize in a few posts, with potential reader fatigue in due consideration.At a summary level, these are the eight steps to data mining salvation.

1. Define the problem and the design of the solution

2. Establish how the tool you are building is going to be used

3. Frame the approach before jumping to the actual technical solution
4. Understand the data
5. Beware the "hammer looking for a nail"
6. Validate your solution
7. Beware the "smoking gun"
8. Establish the value upside and generate buy-in

I will tackle each of these steps in some detail now.

1. Define the problem and the design of the solution
This is the first step. Define the problem that you are really trying to solve. The key here to defining the problem well is to frame the solution in terms of business outcomes.

Complete the following sentence. "Solving this problem will lead to x% increase in sales, y% decrease in costs, a multiplier of efficiency and speed by z" etc. If this sentence does not flow easily, then I am afraid you have not spent the time defining the overall problem well enough.

Understand the context surrounding the problem, why it has been difficult to crack over the years, where is the data going to come from, where has the data come in the past and are there going to be changes to how it will be available in the future? Speak to others who have taken a crack at this problem and their view on where the constraints lie.

Once you have defined the problem well enough, envision what the solution is going to look like. What are the parts of the solution that are pure process excellence, where do you need advanced analytics, so that the place for the analytical piece of the solution (something that the reader of this blog is primarily interested in) is clearly established. The aim should be to create a simple block diagram on how one goes from input (usually data) to output (ideally, a set of decisions) and what are all the pieces that come in between.

2. Establish how the tool you are building is going to be used
In the previous step, the problem has been defined and the solution has been scoped at a high level. Then the next step is to put some detail into how the analytical solution is going to be actually used. Will the model be used mainly for understanding purposes or for doing exploratory analysis? And therefore the results of the analysis implemented using some simple decisioning rules or heuristics. Or is the desire or the plan to use the model in the "live" production environment for decision making? It is important to get good answers or at least good likely answers to all of these questions because they play a very important role in determining the actual tools that will be used to build the solution, the checks and audits that need to be put in the overall system of decision making, the process of overrides, the infrastructure and technology needed to make the solution effective and so on. Also the people aspect needs to be considered at this step. The use-conditions of the tool will determine the type of user training that needs to be provided, also the skills of the end-users that needs to be ensured and how much of the skills can be imparted by on-the-job training vs what skills are entry conditions into the job.

More about all of this in subsequent posts.

Thursday, January 13, 2011

The Joy of Stats is finally online

My first post of 2011. I have been writing this blog for nearly two years now and am happy to keep having the energy and the enthusiasm to keep at it. Like I have said earlier on why I blog, this is way for me to keep abreast of the latest development in the fields of data mining, analytics and visualization.

The Joy of Stats program was aired in BBC4 in December 2010. Now the video is available on Hans Rosling's Gapminder website. This was the program from which the data visualization examples used for mapping San Francisco crimes, the graphics made by Florence Nightingale and the Gapminder visualization of the economic and demographic statistics of different countries over the last 200 years are highlighted.

Another example of nifty graphics. The New York Times has been a trendsetter in putting up very clever and informative graphics supporting its news stories. Amanda Cox of the NYT graphics department did a presentation recently on some of the examples that the NYT has used in its print as well as its online media. This is a long presentation but worth sitting through.

Hopefully you will enjoy both these presentations!