Friday, December 24, 2010

Data mining and the inevitable conflict with privacy issues

The explosion in the availability of data in the past decade and the explosion in analytical techniques to interpret this data and find patterns in them has been a huge benefit for businesses, governments as well as for individual customers. Amongst businesses, examples like Amazon.com, Harrah's, Target, Netflix and FedEx have made the analysis of large data their business model. These companies have come up with increasingly sophisticated and intricate ways of capturing data about customer behaviour and offering targeted products based on the behaviour.

Big government has been somewhat late in the game but are making big strides in the field of data mining. But increasingly, areas like law enforcement, counter-terrorism, anti-money laundering, the IRS have leveraged some cutting edge techniques to get them to be more effective at what they do. Which is usually to detect needle of criminal activity amongst the haystack of normal law-abiding activities, and take the appropriate preventive or retributory action.

But as the saying goes, there are two sides to every coin. While the explosion of data and its analysis has been mostly been driven by good intentions, the consequence of some of this work is beginning to look increasingly murky. For example, if there is surveillance of an individual's emails to identify money laundering trails, where is the bright line between what is legitimate monitoring of criminal activity and the unwanted intrusion in the activities of law abiding citizens? The defense from those who do the monitoring has always been that only suspicious activities are targeted - and also that they use sophisticated analytics to model these criminal activities. But as any model builder worth his salt knows, an effective model is one that maximizes true positives AND minimizes the false positives. The false positives in this case are ones that display the similar so-called 'suspicious' behaviour but turn out to be innocent. How can then one build an effective model by being very exclusive in the data points for this model (i.e. by only including behaviour that is understood to be suspicious)? In order to truly understand the false positives and attempt to reduce them, one HAS to include points in the model-build sample that are very likely to be false positives. And therein lies the paradox. To build a really good predictive system, the sample needs to be randomized to include good and bad outcomes, highly suspicious and borderline innocent behaviours.

I want to share two different perspectives on this issue. The first is from the MIT Technology Review that extols the virtues of a data-driven law enforcement system, as practiced by the police department of Memphis, TN. The link is here. An excerpt from this article:
The predictive software, which is called Blue CRUSH (for "criminal reduction utilizing statistical history"), works by crunching crime and arrest data, then combining it with weather forecasts, economic indicators, and information on events such as paydays and concerts. The result is a series of crime patterns that indicate when and where trouble may be on the way. "It opens your eyes within the precinct," says Godwin. "You can literally know where to put officers on a street in a given time." The city's crime rate has dropped 30 percent since the department began using the software in 2005.

Memphis is one of a small but growing number of U.S. and U.K. police units that are turning to crime analytics software from IBM, SAS Institute, and other vendors. So far, they are reporting similar results. In Richmond, Virginia, the homicide rate dropped 32 percent in one year after the city installed its software in 2006.

Now read this other piece, painting a slightly different picture on what is going on.
Suspicious Activity Report N03821 says a local law enforcement officer observed "a suspicious subject . . . taking photographs of the Orange County Sheriff Department Fire Boat and the Balboa Ferry with a cellular phone camera." ... noted that the subject next made a phone call, walked to his car and returned five minutes later to take more pictures. He was then met by another person, both of whom stood and "observed the boat traffic in the harbor." Next another adult with two small children joined them, and then they all boarded the ferry and crossed the channel.

All of this information was forwarded to the Los Angeles fusion center for further investigation after the local officer ran information about the vehicle and its owner through several crime databases and found nothing ... there are several paths a suspicious activity report can take:
The FBI could collect more information, find no connection to terrorism and mark the file closed, though leaving it in the database.
It could find a possible connection and turn it into a full-fledged case.
Or, as most often happens, it could make no specific determination, which would mean that Suspicious Activity Report N03821 would sit in limbo for as long as five years, during which time many other pieces of information could be added to the file ... employment, financial and residential histories; multiple phone numbers; audio files; video from the dashboard-mounted camera in the police cruiser at the harbor where he took pictures; anything else in government or commercial databases "that adds value".

This is from an insightful piece in the Washington Post titled "Monitoring America". The Post Article goes on to describe the very same Memphis PD and asks some pointed questions on some of the data gathering techniques used.


This is where this whole concept of capturing information at the individual level and using it for specific targeting enters unstable ground - when this takes place in an intrusive manner and without due consent from the individuals. When organizations do it, it can be definitely irritating and border-line creepy. When governments do it, it reminds one of George Orwell's "Big Brother". It will be interesting to see how the field of predictive analytics survives the privacy backlash which is just beginning.

Sitemeter