Big Data Analytics: 2013

Saturday, February 2, 2013

Applications of Big Data Part 3: Creating summary tables from detailed transaction data

The next part in an ongoing series about the value of big data. The first post tried to draw a difference between the value delivered by better analytics generically speaking and by big data, specifically. We talked about 6 specific applications of big data:

1. Reducing storage costs for historical data

2. Where significant batch processing is needed to create a single summarized record

3. When different types of data need to be combined, to create business insight

4. Where there are significant parallel processing needs

5. Where there is a need to have capital expenditure on hardware scale with requirements

6. Where there are significantly large data capture and storage needs, such as data being captured through automated sensors and transducers

In the previous post, I talked about how one of the applications of Big Data is better management of historical data. The low cost storage and the ready accessibility means that historical data can be stored in a Hadoop cluster – and eschewing some of the legacy storage solutions like tape drives, that are less reliable as well as take longer to retrieve.

In this post, I am going to talk about when there is a business need to create a summarized data table on lots and lots of records of underlying data which are event level. There are a large number of use-cases where the business need to create a summarized record exists. The most basic is in reporting and aiding management decision making. Reporting requires information to be summarized across multiple business segments and at a weekly or a monthly frequency. A good example is Personal Financial Management or PFM systems, which classify credit card transactions and provide summaries across merchant categories and for a month. The individual credit card transactions would be stored as individual records across multiple different storage units and a Map-Reduce algorithm would run as a batch program that summarizes this information.

Another application is as segmentation or targeting variables in any kind of marketing and personalization campaign. A third application that is particularly relevant in the Digital marketing and e-commerce world is for use in recommender systems. Recommender systems make product suggestions based on customer profile and their past behaviors – given that these recommendations need to be made in mere milliseconds in the middle of a marketing campaign, running through all available records to extract the relevant piece of profiling information is not technically feasible. What is better is to have a batch job running overnight that summarizes information and creates a single record (with several hundreds of fields) for each customer. Read this link from the IEEE Spectrum magazine "Deconstructing recommender systems" for a particularly good exposition on recommender systems.

Architecture of a data summarization application

So what would the data architecture of such a solution look like? Clearly, the Big Data portion of the overall stack, the transaction level data, would reside in a Hadoop cluster. This would give the unbeatable combination of cheap storage and extremely fast processing time (by virtue of the MapReduce capabilities). The relative shortcoming of this system, the inability to provide random access capabilities to an outside application that reads the database, is irrelevant here. That is because the sole objective of the Hadoop cluster would be to ingest transaction level data and convert into summary attributes through a batch program.

The summary table would need to be built on a traditional RDBMS platform, though there are BigData variants as well like MongoDB that could do the job. The need here is for fast random access for marketing applications, recommender systems and other users of the summary information.

So to summarize, Big Data lends itself extremely nicely to creating data tables that aggregate transaction-level data into entity-level (the most common entity being a customer) information. The work itself would be done through a batch job that can take advantage of MapReduce.

In my next post, I will start to touch upon how BigData is ideally suited to process different types of data, both structured and unstructured.

Friday, January 11, 2013

Applications of Big Data Part 2: Reduced storage costs for historical data

Coming back to my topic on the business rationale for big data or “Why does Big data make sense?" In a previous post on the Business Applications of Big Data, I mentioned six specific applications and benefits of big data in the modern large organization. While doing so, I have been specific about benefits from big data technology and tried to draw a distinction from what are generic benefits from data-driven analytics.

One of my pet peeves when I read about the benefits of big data are that they often relate to the benefits of data and analytics more generically, or (if the author is trying to be at least somewhat intellectually honest), unstructured data and text analytics. Take for example this excerpt from McKinsey’s report on big data. The reason why I am picking on McKinsey here is because they consider themselves (and are considered, in some circles) to be the smartest guys in the room. And I’d have expected them to be a little more discerning when it comes to differentiating between data/ analytics driven business insights and the somewhat narrow technical area, which is big data.

McKinsey says in its report

There are five broad ways in which using big data can create value. First, big data can unlock significant value by making information transparent and usable at much higher frequency. Second, as organizations create and store more transactional data in digital form, they can collect more accurate and detailed performance information … and therefore expose variability and boost performance. Leading companies are using data collection and analysis to conduct controlled experiments to make better management decisions …Third, big data allows ever-narrower segmentation of customers and therefore much more precisely tailored products or services … Fourth, sophisticated analytics can substantially improve decision-making. Finally, big data can be used to improve the development of the next generation of products and services. For instance, manufacturers are using data obtained from sensors embedded in products to create innovative after-sales service offerings such as proactive maintenance (preventive measures that take place before a failure occurs or is even noticed).

Now, ALL of the data points talked about are either too generic (‘can substantially improve decision making’, ‘create the next generation of products and services’) or are things that apply more generically to good data/ analytics based business models (‘use of data at higher frequency, ‘more transactional data’, ‘narrower segmentation and precisely tailored products and services’). And so for someone who is trying to understand specifically whether to stay with traditionally RDBMS or embrace big data, this kind of commentary is useless. What I am going to try and do is to call out some of the benefits of big data that are uniquely driven by the specific big data technologies.

The business reasons why big data is a useful idea for organizations to embrace and implement comes down to a few specific things. All of these have to do with the fundamental technology innovation that drove big data’s growth in the first place. As David Williams, CEO of Merkle explains, if big data was merely an explosion in the amount of data suddenly available, it would be called ‘lots of data’. There's clearly more to this phenomenon, particularly since lots of data has always existed. So what are these technology innovations that typify big data?

These are parallel storage and computation, on commodity hardware using open source software. Often, the hardware is centrally located and managed and connected to the user community through high-speed internet connections. And therefore, the data is not local, but rather resides in a ‘cloud’. These innovations in turn translate to a number of benefits

- Lower cost of storage (as compared to traditional technologies like a database storage, or tape storage) at lower cost

- Lower latency in getting access to really old data

- Faster computing in situations where batch computation suffices (the operative words here are ‘batch’ and ‘suffices’.) Random update and retrieval of individual records, and computation in real-time are not strengths traditionally associated with big data through there are some hybrid providers that now are able to straddle real time processing and batch processing somewhat.

- Flexible database schema, which makes the data infrastructure scalable in the columnar dimension (now, I am sure I made up that phrase). This has not been a direct technology innovation from the original big data architecture as envisaged by Yahoo! and Google, but rather can be considered part of the overall big data ecosystem

It is the first of these technological innovations that lead to the first big business rationale from big data – which is better access to and eventually better use of historical data. The availability of a large amount of historical data translates to better analytics and better predictive models, all else being equal. There is actual empirical data based on which decisions can be taken as against taking educational guesses.

Before big data, organizations did one of several things to manage the large amount of historical data they invariably built up over time. Some of them just threw the data away, after establishing a certain retention period for the data – this would typically be 24-48 months. Others retained portions of the data and threw the rest of it away. So if a certain business operation generated 100 elements of performance data, organizations would retain the ‘important’ ones (the ones typically used for financial reporting and planning) and would throw the rest away. The third strategy was to keep the data but do so in an off-line medium like storage tapes. The problem with storage taps is that they tend to degrade physically and the data is often lost. If not, the data is simply too difficult to retrieve and bring it back on line and so analysts seldom take the trouble of chasing after this data.

With the advent of big data, it is now possible to put historical data away in low-cost, commodity storage. Now this storage is a. low-cost, b. can be retrieved relatively quickly (not necessarily on demand, like one would be able to get from an operational data store) but not with a latency of several days), and c. does not degrade like tapes do. This is one big advantage of big data.

So, if your organization generates a lot of performance data, and the default strategy for managing this data load has been simply to throw the data away, then big data helps in creating an easily accessible storage mechanism for this data. The easy accessibility means that analysts and decision-makers in the organization can use the historical data to delve deep into the data and come up with patterns. This in turn is an enabler of smarter decisions. Big Data therefore enables smarter decisions indirectly - it is not a direct contributor. The analytics that result out of long and reliable historical data drive the smarter decision making.

Saturday, January 5, 2013

Applications of Big Data for Big companies - Part 1

In this post, I am going to start to elaborate on why big data makes sense. Now, this doesn’t clearly sound like ground-breaking insight. You can google “Big data” and you can come up with literally hundreds of articles that will invariably say how the amount of data generated in the world exceeds the storage capacity available. That customers are generating petabytes of data through their interactions, feedback, etc. That cost of computing and storage is a fraction of what it used to be even ten years back. That Google, Amazon, Facebook have invested in big data infrastructure by setting up commodity servers.

But what I have personally found missing in all of this megatrend information, is that there is rarely a clear articulation of why a big company should embrace big data. There are a number of good reports and industry studies on the subject, and the McKinsey report on big data is an exceptional read (the graphic above is derived from the McKinsey Global Institute’s study on big data) – but all of them spend an extensive amount of time making the case for big data technologies, and not enough time, in my opinion, on the business rationale that makes it inevitable for an organization to invest in big data.

So in my understanding of the space, what are some of these elements of business rationale that support investment into big data? (I have to qualify my statement, that these would apply to a typical large organization that already has a well-established RDBMS or traditional-data-based infrastructure. For a start-up, using big data technologies for one’s data infrastructure is a no-brainer decision. The question of rationale comes up when an organization has considerable already invested in traditional data and where the adjustment to introduce big data technologies into the overall ecosystem is not going to be trivial.)

There are 6 specific areas where I have been able to find a sound business rationale for investing in big data. These are:

1. Reducing storage costs for historical data and allowing data to be retained for extended periods and making it readily accessible

2. Where significant batch processing is needed to create a single summarized record (for different downstream business decisions) Creating a single summarized record based on batch processing

3. When different types of data need to be combined, to create business insight – or rather to get slightly more specific, to create a single summarized customer-level record

4. Where there are significant parallel processing needs

5. Where there is a need to have capital expenditure on hardware scale with requirements

6. Where there are significant data capture and storage needs

In subsequent posts, I will make these different elements of business rationale tangible through specific business situations.

Tuesday, January 1, 2013

Back to the blog and de-hyping Big Data

Getting back to writing this blog after really long. What happened in the middle? Well, I got lazy and I got somewhat busy.

But obviously the world has not stood over this period. The topic of my previous set of posts, Tesco, came and now has declared its intention to leave the US. Big Data has become, well, really big though my skepticism still is quite intact. And then there was the small matter of a presidential election where analytics and data modeling really came into its own.

So there’s plenty to catch up on, for the inactivity of my past several months. But hey, New Year resolutions are there for a reason and so it is my commitment to be a lot more regular and disciplined about my posts.

My first post is on big data, as this is clearly going to be an important part of analytics and related infrastructure for the next several years. As you readers probably know, I started out from a place of a little bit of skepticism. My understanding has evolved a little bit over the last few months and I think I am in a much better place to articulate, primarily for myself why big data does make sense – mostly in a business sense and less in a purely tech-geeky sense. So I will try and do that over the next few posts.

But let me first start off with a reference to a post by Bill Franks, Big Data evangelist and Chief Analytics Officer for Teradata Alliances. Bill has spoken about big data extensively and most recently, has mused whether big data is all hype.

His take is interesting, in that he does not think the big data story is built on an empty premise. There are genuine underlying business problems that need solving and genuine underlying technologies that provide a set of viable options to solve the problems. But he does believe that there are a multitude of technology options coming out – almost on a daily basis and that a shakeout amongst the players is imminent. Also, organizations will realize that just installing a Hadoop cluster is not the Big Data destination. The destination is a analytics and data infrastructure solution that is fast, cheap and scalable which does exist today, but which is “potential that can only be extracted with a concerted, focused, and intelligent effort”.

My own quest has been to define for myself why does big data make sense from a business standpoint. Especially for a big Fortune 500 company, with the underlying assumption that there are different sets of economic motivators for big organizations vs. start-ups. I have been trying to educate myself through building up a detailed understanding of the underlying technologies, speaking to industry experts and practitioners and attending industry seminars. I will share my findings over the next few posts.