Friday, December 24, 2010

Data mining and the inevitable conflict with privacy issues

The explosion in the availability of data in the past decade and the explosion in analytical techniques to interpret this data and find patterns in them has been a huge benefit for businesses, governments as well as for individual customers. Amongst businesses, examples like Amazon.com, Harrah's, Target, Netflix and FedEx have made the analysis of large data their business model. These companies have come up with increasingly sophisticated and intricate ways of capturing data about customer behaviour and offering targeted products based on the behaviour.

Big government has been somewhat late in the game but are making big strides in the field of data mining. But increasingly, areas like law enforcement, counter-terrorism, anti-money laundering, the IRS have leveraged some cutting edge techniques to get them to be more effective at what they do. Which is usually to detect needle of criminal activity amongst the haystack of normal law-abiding activities, and take the appropriate preventive or retributory action.

But as the saying goes, there are two sides to every coin. While the explosion of data and its analysis has been mostly been driven by good intentions, the consequence of some of this work is beginning to look increasingly murky. For example, if there is surveillance of an individual's emails to identify money laundering trails, where is the bright line between what is legitimate monitoring of criminal activity and the unwanted intrusion in the activities of law abiding citizens? The defense from those who do the monitoring has always been that only suspicious activities are targeted - and also that they use sophisticated analytics to model these criminal activities. But as any model builder worth his salt knows, an effective model is one that maximizes true positives AND minimizes the false positives. The false positives in this case are ones that display the similar so-called 'suspicious' behaviour but turn out to be innocent. How can then one build an effective model by being very exclusive in the data points for this model (i.e. by only including behaviour that is understood to be suspicious)? In order to truly understand the false positives and attempt to reduce them, one HAS to include points in the model-build sample that are very likely to be false positives. And therein lies the paradox. To build a really good predictive system, the sample needs to be randomized to include good and bad outcomes, highly suspicious and borderline innocent behaviours.

I want to share two different perspectives on this issue. The first is from the MIT Technology Review that extols the virtues of a data-driven law enforcement system, as practiced by the police department of Memphis, TN. The link is here. An excerpt from this article:
The predictive software, which is called Blue CRUSH (for "criminal reduction utilizing statistical history"), works by crunching crime and arrest data, then combining it with weather forecasts, economic indicators, and information on events such as paydays and concerts. The result is a series of crime patterns that indicate when and where trouble may be on the way. "It opens your eyes within the precinct," says Godwin. "You can literally know where to put officers on a street in a given time." The city's crime rate has dropped 30 percent since the department began using the software in 2005.

Memphis is one of a small but growing number of U.S. and U.K. police units that are turning to crime analytics software from IBM, SAS Institute, and other vendors. So far, they are reporting similar results. In Richmond, Virginia, the homicide rate dropped 32 percent in one year after the city installed its software in 2006.

Now read this other piece, painting a slightly different picture on what is going on.
Suspicious Activity Report N03821 says a local law enforcement officer observed "a suspicious subject . . . taking photographs of the Orange County Sheriff Department Fire Boat and the Balboa Ferry with a cellular phone camera." ... noted that the subject next made a phone call, walked to his car and returned five minutes later to take more pictures. He was then met by another person, both of whom stood and "observed the boat traffic in the harbor." Next another adult with two small children joined them, and then they all boarded the ferry and crossed the channel.

All of this information was forwarded to the Los Angeles fusion center for further investigation after the local officer ran information about the vehicle and its owner through several crime databases and found nothing ... there are several paths a suspicious activity report can take:
The FBI could collect more information, find no connection to terrorism and mark the file closed, though leaving it in the database.
It could find a possible connection and turn it into a full-fledged case.
Or, as most often happens, it could make no specific determination, which would mean that Suspicious Activity Report N03821 would sit in limbo for as long as five years, during which time many other pieces of information could be added to the file ... employment, financial and residential histories; multiple phone numbers; audio files; video from the dashboard-mounted camera in the police cruiser at the harbor where he took pictures; anything else in government or commercial databases "that adds value".

This is from an insightful piece in the Washington Post titled "Monitoring America". The Post Article goes on to describe the very same Memphis PD and asks some pointed questions on some of the data gathering techniques used.


This is where this whole concept of capturing information at the individual level and using it for specific targeting enters unstable ground - when this takes place in an intrusive manner and without due consent from the individuals. When organizations do it, it can be definitely irritating and border-line creepy. When governments do it, it reminds one of George Orwell's "Big Brother". It will be interesting to see how the field of predictive analytics survives the privacy backlash which is just beginning.

Saturday, December 18, 2010

Visualization of the data and animation - part II

I had written a piece earlier about Hans Rosling's animation of country-level data using the Gapminder tool. Here are some more examples of some extremely cool examples of data animation.

At the start of this series, there is more animation from the Joy Of Stats program that Rosling hosted in the BBC. The landing page is a link that shows the plotting of crime data in downtown San Francisco and how this visual overlay on the city topography provides some valuable insights on where one might expect to find crime. This is a valuable tool for police departments (to try and prevent crime that is local to an area and has some element of predictability), residents (to research neighbourhoods before they buy property, for example) and tourists (who might want to doublecheck a part of the city before deciding on a really attractive Priceline.com hotel deal). The researchers who have created this tool that maps the crime data to maps. The researchers in the clip talk about how tools such as this can be used to improve citizen power and government accountability. Another good example of crime data, this time reported by Police Departments across the US can be found here. Finally, towards the end of the clip, the researchers go on to mention what could be the Holy Grail of this kind of visualization. They talk about how real-time data put up on social media and networking sites like Facebook and Twitter (geo-tagged perhaps) could provide a real-time feed into these maps. Now this would have been certainly in the realm of science fiction only a few years back but suddenly now it doesn't seem as impossible.

The San Francisco crime mapping link has a few other really impressive videos as you scroll further down. I really like the one of Florence Nightingale, whose graphs during the Crimean war helped reveal important insights on how injuries and deaths were occurring in hospitals. It is interesting to know that Lady of the Lantern was not just renowned for tending for the sick, but also was a keen student of statistics. Her graphs of deaths which were accidental, caused by war injuries and wounds and finally those that were preventable (and caused by poor hygiene that was quite prevalent at the time) created a very powerful imagery of the high incidence of preventable deaths and the need to address this area with the right focus.

Why is visualization and animation of data helpful and such a critical tool in the arsenal of any serious data scientist? For a few reasons.

For one, it helps tell a story way better than equations or tables of data do. That is so essential to convey the message to people who are not necessarily experts who have insight into the tables, but are important influencers and stakeholders nevertheless who need to be educated on the subject being conveyed. Think of it as how an advertisement (either picture or moving image) is more powerful in conveying the strength of a brand as compared to boring old text.
The other reason, in my opinion, is that graphical depiction and visualization of the data allows the powerful human brain (which is far more powerful than any computer at pattern recognition) to take over the part of data analysis that the human brain is really good at and computers generally not so good at. This is forming hypotheses on-the-fly about the data being displayed and reaching conclusions based on visual patterns in the data. Also the ability to hook into remote memory banks within our brains and form linkages. While Machine Learning and AI are admirable goals, there is still some way to go before computers can match the sheer ingenuity and flexibility of thought that the human brain possesses.

Sunday, December 12, 2010

Thinking statistically – and why that’s so difficult

I came across this piece from a few months back by the Wired magazine writer, Clive Thompson on “Why we should learn the language of data”. The article is one amongst a stream of recent articles in the popular media of how data-driven applications are changing our world. The New York Times has had quite a few pieces on this topic recently.

Clive Thompson calls out how the language of data and statistics is going to be transformational for the world, going forward and how it needs to be core part of general education. Thompson also calls out why thinking about data trends or statistics is hard. It is hard because it is not something that the intuitive wiring in the human brain readily recognizes or appreciates. The human psyche with its fight-or-flight instincts reacts to big, dramatic events well and to subtle trends badly. We are not fundamentally good at a number of things that good decision making calls for, such as being open to both supporting and refuting evidence, not confusing correlation and causality, factoring uncertainty, estimating rare events.

Most of the applications where a data-driven insight has changed the world in any meaningful way have been driven by private enterprise. These changes have also been somewhat incremental in nature. Of course, it has allowed companies to recommend movies to interested subscribers, position goods in stores more effectively, distribute at lower cost, price tickets so as to ensure maximum returns and so on. In other words, these changes may have been game changing for specific industries but not necessarily for the entire human race at large.

Numbers can have greater power than just impacting a few industries at a time, one would think. Just given the sheer amount of data that is being produced in the world today and the rate at which both computing power and bandwidth continues to grow, we ought to have seen a much more wide ranging impact from data driven analysis. We should have been firmly down the road to making progress on combating global warming, diseases like heart disease, diabetes and cancer. Government agencies which are a really big part of the modern economy has not been as successful at driving this form of data driven innovation. Why is that?

This probably has got to do with a fundamental lack of understanding of numbers and statistics, amongst the population at large. The places in the world where a lot of the data gathering and processing is happening, i.e. the Western world, are also the places where an education in science and math is somewhat undervalued in relation to studies like liberal arts, media, legal studies, etc. That is where the emerging economies of the world have an edge. Study of math, science and engineering has always been appropriately valued in countries like India, China and other emerging Asian giants. Now as these countries also begin to generate, process and store data, the math and science educated talent will be chafing at the bit to get into the data and harness its potential. Data has been rightly called as another factor of production like labour, capital and land. It is an irony in the world today that those who have data within easy reach are less inclined to use it.

Friday, December 10, 2010

Swarm Intelligence, Ant Colony Optimizations – advances in analytic computing

Advances in computing have led to some new and interesting developments in the areas of new modeling techniques. This post is going to give some examples of these kinds of techniques. But before that, a small primer on basic modeling techniques. Most of the more commonly used models are generalized linear models. As the name suggests, these models try to establish a more-or-less linear relationships between what is tried to be predicted and what the inputs are. Ultimately the model fit problem is an optimization problem – an attempt to use a generalized curve to represent the data and while doing so, minimize the gap between the actual data and the approximate representation of the data produced by the model.

Of course, optimization problems present themselves in a number of areas. One is of course model fitting but other applications are in areas like planning and logistics – an example being the ever-popular traveling salesman problem. One of the more recent and interesting techniques in solving optimization problems is through a technique called Ant Colony Optimization (ACO). The optimization is a part of series of more generic AI/ machine learning tools called swarm intelligence. Wikipedia defines swarm intelligence as follows
Swarm intelligence (SI) is the collective behaviour of decentralized, self-organized systems, natural or artificial…. SI systems are typically made up of a population of simple agents or bodies interacting locally with one another and with their environment. The agents follow very simple rules, and although there is no centralized control structure dictating how individual agents should behave, local, and to a certain degree random, interactions between such agents lead to the emergence of "intelligent" global behavior, unknown to the individual agents.

The ACO algorithm tries to mimic the behaviour of ants in search of food. When ants forage for food, every ant involved in the foraging process moves out of the colony in random ways to search for food. When a food source is located, the ant uses the scent trail of its own pheromones to bring the food back to the colony. Other ants begin to then use the trail left behind by the first ant to make further excursions to the food source and bring back food. Also, by the very nature of the pheromone trail (which is a volatile chemical and therefore evaporates after a certain point in time), the tendency of later ants is to follow more recent and fresher trails, which should also be the shortest ones logically speaking.

One of the more interesting business applications has been indeed in the area of material movement, i.e. logistics. The Italian pasta maker Barilla as well as Migros, the Swiss supermarket chain have been using these techniques to optimize their distribution networks and routes. A paper about this technique is available here. It is more technical. A more layman-friendly treatment of the technique appeared recently in the Economist and was also an interesting read.

Sitemeter