Saturday, February 2, 2013

Applications of Big Data Part 3: Creating summary tables from detailed transaction data


The next part in an ongoing series about the value of big data. The first post tried to draw a difference between the value delivered by better analytics generically speaking and by big data, specifically. We talked about 6 specific applications of big data:
1. Reducing storage costs for historical data
2. Where significant batch processing is needed to create a single summarized record
3. When different types of data need to be combined, to create business insight
4. Where there are significant parallel processing needs
5. Where there is a need to have capital expenditure on hardware scale with requirements
6. Where there are significantly large data capture and storage needs, such as data being captured through automated sensors and transducers

In the previous post, I talked about how one of the applications of Big Data is better management of historical data. The low cost storage and the ready accessibility means that historical data can be stored in a Hadoop cluster – and eschewing some of the legacy storage solutions like tape drives, that are less reliable as well as take longer to retrieve.

In this post, I am going to talk about when there is a business need to create a summarized data table on lots and lots of records of underlying data which are event level. There are a large number of use-cases where the business need to create a summarized record exists. The most basic is in reporting and aiding management decision making. Reporting requires information to be summarized across multiple business segments and at a weekly or a monthly frequency. A good example is Personal Financial Management or PFM systems, which classify credit card transactions and provide summaries across merchant categories and for a month. The individual credit card transactions would be stored as individual records across multiple different storage units and a Map-Reduce algorithm would run as a batch program that summarizes this information.

Another application is as segmentation or targeting variables in any kind of marketing and personalization campaign. A third application that is particularly relevant in the Digital marketing and e-commerce world is for use in recommender systems. Recommender systems make product suggestions based on customer profile and their past behaviors – given that these recommendations need to be made in mere milliseconds in the middle of a marketing campaign, running through all available records to extract the relevant piece of profiling information is not technically feasible. What is better is to have a batch job running overnight that summarizes information and creates a single record (with several hundreds of fields) for each customer. Read this link from the IEEE Spectrum magazine "Deconstructing recommender systems" for a particularly good exposition on recommender systems.

Architecture of a data summarization application


So what would the data architecture of such a solution look like? Clearly, the Big Data portion of the overall stack, the transaction level data, would reside in a Hadoop cluster. This would give the unbeatable combination of cheap storage and extremely fast processing time (by virtue of the MapReduce capabilities). The relative shortcoming of this system, the inability to provide random access capabilities to an outside application that reads the database, is irrelevant here. That is because the sole objective of the Hadoop cluster would be to ingest transaction level data and convert into summary attributes through a batch program.

The summary table would need to be built on a traditional RDBMS platform, though there are BigData variants as well like MongoDB that could do the job. The need here is for fast random access for marketing applications, recommender systems and other users of the summary information.

So to summarize, Big Data lends itself extremely nicely to creating data tables that aggregate transaction-level data into entity-level (the most common entity being a customer) information. The work itself would be done through a batch job that can take advantage of MapReduce.

In my next post, I will start to touch upon how BigData is ideally suited to process different types of data, both structured and unstructured.

7 comments:

Anonymous said...

Everyone loves what you guys are up too. Such clever work and coverage!
Keep up the awesome works guys I've added you guys to blogroll.
Also visit my web site : abcdistributing catalog

Anonymous said...

I think the admin of this web page is actually working hard in favor of his site, for the reason that
here every stuff is quality based material.


My web site - st augustine MMA
My page - gracie st sugustine

Anonymous said...

Nice post. I was checking constantly this blog and I'm impressed! Extremely useful information specially the last part :) I care for such information much. I was looking for this particular information for a very long time. Thank you and best of luck.

Also visit my page :: Corporate Movers

Anonymous said...

Thiѕ іnfo iѕ pricеlesѕ. Ηow can I find out moгe?


Hеrе іѕ my pаge payday quick loan

Anonymous said...

That is verу fascinating, Yοu're an overly professional blogger. I have joined your feed and stay up for in search of extra of your fantastic post. Additionally, I'ѵе shared your website in my ѕocial nеtworks

Herе iѕ my ωeb pagе premature ejaculation pills

Anonymous said...

Getting the services of an matt cutts campaign is individual
and depends completely on your matt cutts services and the cost of
having them promote your site. Check Rankings If you are looking
for.

Feel free to surf to my website; high search engine ranking optimization

Anonymous said...

Nice post. I learn ѕomething totally
new and challenging on websitеs I stumbleupοn еveryday.
It will alwayѕ be useful to read thrоugh
content from other authors and use а littlе something from other websіtes.


Feеl freе to surf to my web blog crearfacebook.webs.com

Sitemeter