The next part in an ongoing series about the value of big data. The first post tried to draw a difference between the value delivered by better analytics generically speaking and by big data, specifically. We talked about 6 specific applications of big data:
1. Reducing storage costs for historical data
2. Where significant batch processing is needed to create a
single summarized record
3. When different types of data need to be combined, to
create business insight
4. Where there are significant parallel processing needs
5. Where there is a need to have capital expenditure on
hardware scale with requirements
6. Where there are significantly large data capture and
storage needs, such as data being captured through automated sensors and
transducers
In the previous post, I talked about how one of the
applications of Big Data is better management of historical data.
The low cost storage and the ready accessibility means that historical data can
be stored in a Hadoop cluster – and eschewing some of the legacy storage
solutions like tape drives, that are less reliable as well as take longer to
retrieve.
In this post, I am going to talk about when there is a
business need to create a summarized data table on lots and lots of records of
underlying data which are event level. There are a large number of use-cases where
the business need to create a summarized record exists. The most basic is in reporting
and aiding management decision making. Reporting requires information to be
summarized across multiple business segments and at a weekly or a monthly
frequency. A good example is Personal Financial Management or PFM systems,
which classify credit card transactions and provide summaries across merchant
categories and for a month. The individual credit card transactions would be
stored as individual records across multiple different storage units and a
Map-Reduce algorithm would run as a batch program that summarizes this
information.
Another application is as segmentation or targeting
variables in any kind of marketing and personalization campaign. A third
application that is particularly relevant in the Digital marketing and
e-commerce world is for use in recommender systems. Recommender systems make
product suggestions based on customer profile and their past behaviors – given that
these recommendations need to be made in mere milliseconds in the middle of a
marketing campaign, running through all available records to extract the
relevant piece of profiling information is not technically feasible. What is
better is to have a batch job running overnight that summarizes information and
creates a single record (with several hundreds of fields) for each customer. Read
this link from the IEEE Spectrum magazine "Deconstructing recommender systems" for a particularly good exposition on recommender systems.
Architecture of a data summarization application
So what would the data architecture of such a solution look
like? Clearly, the Big Data portion of the overall stack, the transaction level
data, would reside in a Hadoop cluster. This would give the unbeatable
combination of cheap storage and extremely fast processing time (by virtue of
the MapReduce capabilities). The relative shortcoming of this system, the
inability to provide random access capabilities to an outside application that
reads the database, is irrelevant here. That is because the sole objective of
the Hadoop cluster would be to ingest transaction level data and convert into
summary attributes through a batch program.
The summary table would need to be built on a traditional
RDBMS platform, though there are BigData variants as well like MongoDB that
could do the job. The need here is for fast random access for marketing
applications, recommender systems and other users of the summary information.
So to summarize, Big Data lends itself extremely nicely to
creating data tables that aggregate transaction-level data into entity-level
(the most common entity being a customer) information. The work itself would be
done through a batch job that can take advantage of MapReduce.
In my next post, I will start to touch upon how BigData is
ideally suited to process different types of data, both structured and
unstructured.