Tuesday, September 30, 2014

Real-time analytics infrastructures

As the BigData (Hadoop) explosion has taken hold, architectures that provide analytic access to even larger data for even deeper insights have started becoming the norm in many top data science driven organizations. One of them is the approach used in LinkedIn - the LinkedIn analytic blog is typically rich with ideas on how to approach analytics.

What are some of the challenges that some of these architectures are trying to tackle?

- The need to organize data together from different data sources into one single entity. In most cases, this is typically an individual. In the case of LinkedIn, it is the professional member. For Facebook, one of its 1.2 billion subscribers. For a bank, one of its customers. While the actual integration of the data might sound trivial, the effort involved is highly manual. Especially in legacy organizations like banks that have different source systems (managed by different vendors) performing their billing, transactions processing, bill-pay, etc. the complexity of the data that comes into the bank in the form of raw files can be truly mind-boggling. Think multiple different data transfer formats (I recently came across the good old EBCDIC format), files with specific keys and identities relevant to that system, and so on. These files need to be converted into a common format that is readable by all internal applications, also organized around one internal key.

- Next, the need for the data to be value-enhanced in a consistent manner. Raw data is seldom useful to all users without any form of value-addition. This value addition could be something simple i.e. taking the relationship opening date and converting this into a length of relationship indicator. So, say the relationship opening date is 1/1/2003, the length of relationship is 11 years. Or it could be a more complex synthetic attribute that uses multiple raw data elements and combines them together. An example is credit card utilization, which is the balance divided by available credit limit. The problem with this kind of value enhancement is that different people could choose to do this in different ways. So creating multiple such synthetic attributes in the data ecosystem - which can be confusing to the user. Creating a data architecture which allows these kinds of synthetic attributes to be defined once and then used multiple times can be a useful solution to the problem I just described.

- The need to respond to queries to the data environment within an acceptable time interval. Also known as the service level or SLA that an application demands, any data product needs to meet business or user needs in terms of the number of concurrent users, query latency times. The raw HDFS infrastructure was always designed for batch and not for any real-time access patterns. Enabling these patterns requires the data to be pre-organized and processed through some kind of batch approach - so as to make it consumption ready. That when combined with the need to maintain close to real-time data relevance, means that the architecture needs to use elements beyond just the basic Hadoop infrastructure.

These are just some of the reasons why BigData applications and implementations need to be pay special attention to the architecture and the choice of the different component systems. 

No comments:

Sitemeter