Saturday, February 2, 2013

Applications of Big Data Part 3: Creating summary tables from detailed transaction data


The next part in an ongoing series about the value of big data. The first post tried to draw a difference between the value delivered by better analytics generically speaking and by big data, specifically. We talked about 6 specific applications of big data:
1. Reducing storage costs for historical data
2. Where significant batch processing is needed to create a single summarized record
3. When different types of data need to be combined, to create business insight
4. Where there are significant parallel processing needs
5. Where there is a need to have capital expenditure on hardware scale with requirements
6. Where there are significantly large data capture and storage needs, such as data being captured through automated sensors and transducers

In the previous post, I talked about how one of the applications of Big Data is better management of historical data. The low cost storage and the ready accessibility means that historical data can be stored in a Hadoop cluster – and eschewing some of the legacy storage solutions like tape drives, that are less reliable as well as take longer to retrieve.

In this post, I am going to talk about when there is a business need to create a summarized data table on lots and lots of records of underlying data which are event level. There are a large number of use-cases where the business need to create a summarized record exists. The most basic is in reporting and aiding management decision making. Reporting requires information to be summarized across multiple business segments and at a weekly or a monthly frequency. A good example is Personal Financial Management or PFM systems, which classify credit card transactions and provide summaries across merchant categories and for a month. The individual credit card transactions would be stored as individual records across multiple different storage units and a Map-Reduce algorithm would run as a batch program that summarizes this information.

Another application is as segmentation or targeting variables in any kind of marketing and personalization campaign. A third application that is particularly relevant in the Digital marketing and e-commerce world is for use in recommender systems. Recommender systems make product suggestions based on customer profile and their past behaviors – given that these recommendations need to be made in mere milliseconds in the middle of a marketing campaign, running through all available records to extract the relevant piece of profiling information is not technically feasible. What is better is to have a batch job running overnight that summarizes information and creates a single record (with several hundreds of fields) for each customer. Read this link from the IEEE Spectrum magazine "Deconstructing recommender systems" for a particularly good exposition on recommender systems.

Architecture of a data summarization application


So what would the data architecture of such a solution look like? Clearly, the Big Data portion of the overall stack, the transaction level data, would reside in a Hadoop cluster. This would give the unbeatable combination of cheap storage and extremely fast processing time (by virtue of the MapReduce capabilities). The relative shortcoming of this system, the inability to provide random access capabilities to an outside application that reads the database, is irrelevant here. That is because the sole objective of the Hadoop cluster would be to ingest transaction level data and convert into summary attributes through a batch program.

The summary table would need to be built on a traditional RDBMS platform, though there are BigData variants as well like MongoDB that could do the job. The need here is for fast random access for marketing applications, recommender systems and other users of the summary information.

So to summarize, Big Data lends itself extremely nicely to creating data tables that aggregate transaction-level data into entity-level (the most common entity being a customer) information. The work itself would be done through a batch job that can take advantage of MapReduce.

In my next post, I will start to touch upon how BigData is ideally suited to process different types of data, both structured and unstructured.

Sitemeter