Big Data Analytics: January 2013

Friday, January 11, 2013

Applications of Big Data Part 2: Reduced storage costs for historical data

Coming back to my topic on the business rationale for big data or “Why does Big data make sense?" In a previous post on the Business Applications of Big Data, I mentioned six specific applications and benefits of big data in the modern large organization. While doing so, I have been specific about benefits from big data technology and tried to draw a distinction from what are generic benefits from data-driven analytics.

One of my pet peeves when I read about the benefits of big data are that they often relate to the benefits of data and analytics more generically, or (if the author is trying to be at least somewhat intellectually honest), unstructured data and text analytics. Take for example this excerpt from McKinsey’s report on big data. The reason why I am picking on McKinsey here is because they consider themselves (and are considered, in some circles) to be the smartest guys in the room. And I’d have expected them to be a little more discerning when it comes to differentiating between data/ analytics driven business insights and the somewhat narrow technical area, which is big data.

McKinsey says in its report

There are five broad ways in which using big data can create value. First, big data can unlock significant value by making information transparent and usable at much higher frequency. Second, as organizations create and store more transactional data in digital form, they can collect more accurate and detailed performance information … and therefore expose variability and boost performance. Leading companies are using data collection and analysis to conduct controlled experiments to make better management decisions …Third, big data allows ever-narrower segmentation of customers and therefore much more precisely tailored products or services … Fourth, sophisticated analytics can substantially improve decision-making. Finally, big data can be used to improve the development of the next generation of products and services. For instance, manufacturers are using data obtained from sensors embedded in products to create innovative after-sales service offerings such as proactive maintenance (preventive measures that take place before a failure occurs or is even noticed).

Now, ALL of the data points talked about are either too generic (‘can substantially improve decision making’, ‘create the next generation of products and services’) or are things that apply more generically to good data/ analytics based business models (‘use of data at higher frequency, ‘more transactional data’, ‘narrower segmentation and precisely tailored products and services’). And so for someone who is trying to understand specifically whether to stay with traditionally RDBMS or embrace big data, this kind of commentary is useless. What I am going to try and do is to call out some of the benefits of big data that are uniquely driven by the specific big data technologies.

The business reasons why big data is a useful idea for organizations to embrace and implement comes down to a few specific things. All of these have to do with the fundamental technology innovation that drove big data’s growth in the first place. As David Williams, CEO of Merkle explains, if big data was merely an explosion in the amount of data suddenly available, it would be called ‘lots of data’. There's clearly more to this phenomenon, particularly since lots of data has always existed. So what are these technology innovations that typify big data?

These are parallel storage and computation, on commodity hardware using open source software. Often, the hardware is centrally located and managed and connected to the user community through high-speed internet connections. And therefore, the data is not local, but rather resides in a ‘cloud’. These innovations in turn translate to a number of benefits

- Lower cost of storage (as compared to traditional technologies like a database storage, or tape storage) at lower cost

- Lower latency in getting access to really old data

- Faster computing in situations where batch computation suffices (the operative words here are ‘batch’ and ‘suffices’.) Random update and retrieval of individual records, and computation in real-time are not strengths traditionally associated with big data through there are some hybrid providers that now are able to straddle real time processing and batch processing somewhat.

- Flexible database schema, which makes the data infrastructure scalable in the columnar dimension (now, I am sure I made up that phrase). This has not been a direct technology innovation from the original big data architecture as envisaged by Yahoo! and Google, but rather can be considered part of the overall big data ecosystem

It is the first of these technological innovations that lead to the first big business rationale from big data – which is better access to and eventually better use of historical data. The availability of a large amount of historical data translates to better analytics and better predictive models, all else being equal. There is actual empirical data based on which decisions can be taken as against taking educational guesses.

Before big data, organizations did one of several things to manage the large amount of historical data they invariably built up over time. Some of them just threw the data away, after establishing a certain retention period for the data – this would typically be 24-48 months. Others retained portions of the data and threw the rest of it away. So if a certain business operation generated 100 elements of performance data, organizations would retain the ‘important’ ones (the ones typically used for financial reporting and planning) and would throw the rest away. The third strategy was to keep the data but do so in an off-line medium like storage tapes. The problem with storage taps is that they tend to degrade physically and the data is often lost. If not, the data is simply too difficult to retrieve and bring it back on line and so analysts seldom take the trouble of chasing after this data.

With the advent of big data, it is now possible to put historical data away in low-cost, commodity storage. Now this storage is a. low-cost, b. can be retrieved relatively quickly (not necessarily on demand, like one would be able to get from an operational data store) but not with a latency of several days), and c. does not degrade like tapes do. This is one big advantage of big data.

So, if your organization generates a lot of performance data, and the default strategy for managing this data load has been simply to throw the data away, then big data helps in creating an easily accessible storage mechanism for this data. The easy accessibility means that analysts and decision-makers in the organization can use the historical data to delve deep into the data and come up with patterns. This in turn is an enabler of smarter decisions. Big Data therefore enables smarter decisions indirectly - it is not a direct contributor. The analytics that result out of long and reliable historical data drive the smarter decision making.

Saturday, January 5, 2013

Applications of Big Data for Big companies - Part 1

In this post, I am going to start to elaborate on why big data makes sense. Now, this doesn’t clearly sound like ground-breaking insight. You can google “Big data” and you can come up with literally hundreds of articles that will invariably say how the amount of data generated in the world exceeds the storage capacity available. That customers are generating petabytes of data through their interactions, feedback, etc. That cost of computing and storage is a fraction of what it used to be even ten years back. That Google, Amazon, Facebook have invested in big data infrastructure by setting up commodity servers.

But what I have personally found missing in all of this megatrend information, is that there is rarely a clear articulation of why a big company should embrace big data. There are a number of good reports and industry studies on the subject, and the McKinsey report on big data is an exceptional read (the graphic above is derived from the McKinsey Global Institute’s study on big data) – but all of them spend an extensive amount of time making the case for big data technologies, and not enough time, in my opinion, on the business rationale that makes it inevitable for an organization to invest in big data.

So in my understanding of the space, what are some of these elements of business rationale that support investment into big data? (I have to qualify my statement, that these would apply to a typical large organization that already has a well-established RDBMS or traditional-data-based infrastructure. For a start-up, using big data technologies for one’s data infrastructure is a no-brainer decision. The question of rationale comes up when an organization has considerable already invested in traditional data and where the adjustment to introduce big data technologies into the overall ecosystem is not going to be trivial.)

There are 6 specific areas where I have been able to find a sound business rationale for investing in big data. These are:

1. Reducing storage costs for historical data and allowing data to be retained for extended periods and making it readily accessible

2. Where significant batch processing is needed to create a single summarized record (for different downstream business decisions) Creating a single summarized record based on batch processing

3. When different types of data need to be combined, to create business insight – or rather to get slightly more specific, to create a single summarized customer-level record

4. Where there are significant parallel processing needs

5. Where there is a need to have capital expenditure on hardware scale with requirements

6. Where there are significant data capture and storage needs

In subsequent posts, I will make these different elements of business rationale tangible through specific business situations.

Tuesday, January 1, 2013

Back to the blog and de-hyping Big Data

Getting back to writing this blog after really long. What happened in the middle? Well, I got lazy and I got somewhat busy.

But obviously the world has not stood over this period. The topic of my previous set of posts, Tesco, came and now has declared its intention to leave the US. Big Data has become, well, really big though my skepticism still is quite intact. And then there was the small matter of a presidential election where analytics and data modeling really came into its own.

So there’s plenty to catch up on, for the inactivity of my past several months. But hey, New Year resolutions are there for a reason and so it is my commitment to be a lot more regular and disciplined about my posts.

My first post is on big data, as this is clearly going to be an important part of analytics and related infrastructure for the next several years. As you readers probably know, I started out from a place of a little bit of skepticism. My understanding has evolved a little bit over the last few months and I think I am in a much better place to articulate, primarily for myself why big data does make sense – mostly in a business sense and less in a purely tech-geeky sense. So I will try and do that over the next few posts.

But let me first start off with a reference to a post by Bill Franks, Big Data evangelist and Chief Analytics Officer for Teradata Alliances. Bill has spoken about big data extensively and most recently, has mused whether big data is all hype.

His take is interesting, in that he does not think the big data story is built on an empty premise. There are genuine underlying business problems that need solving and genuine underlying technologies that provide a set of viable options to solve the problems. But he does believe that there are a multitude of technology options coming out – almost on a daily basis and that a shakeout amongst the players is imminent. Also, organizations will realize that just installing a Hadoop cluster is not the Big Data destination. The destination is a analytics and data infrastructure solution that is fast, cheap and scalable which does exist today, but which is “potential that can only be extracted with a concerted, focused, and intelligent effort”.

My own quest has been to define for myself why does big data make sense from a business standpoint. Especially for a big Fortune 500 company, with the underlying assumption that there are different sets of economic motivators for big organizations vs. start-ups. I have been trying to educate myself through building up a detailed understanding of the underlying technologies, speaking to industry experts and practitioners and attending industry seminars. I will share my findings over the next few posts.