Tuesday, September 30, 2014

Real-time analytics infrastructures

As the BigData (Hadoop) explosion has taken hold, architectures that provide analytic access to even larger data for even deeper insights have started becoming the norm in many top data science driven organizations. One of them is the approach used in LinkedIn - the LinkedIn analytic blog is typically rich with ideas on how to approach analytics.

What are some of the challenges that some of these architectures are trying to tackle?

- The need to organize data together from different data sources into one single entity. In most cases, this is typically an individual. In the case of LinkedIn, it is the professional member. For Facebook, one of its 1.2 billion subscribers. For a bank, one of its customers. While the actual integration of the data might sound trivial, the effort involved is highly manual. Especially in legacy organizations like banks that have different source systems (managed by different vendors) performing their billing, transactions processing, bill-pay, etc. the complexity of the data that comes into the bank in the form of raw files can be truly mind-boggling. Think multiple different data transfer formats (I recently came across the good old EBCDIC format), files with specific keys and identities relevant to that system, and so on. These files need to be converted into a common format that is readable by all internal applications, also organized around one internal key.

- Next, the need for the data to be value-enhanced in a consistent manner. Raw data is seldom useful to all users without any form of value-addition. This value addition could be something simple i.e. taking the relationship opening date and converting this into a length of relationship indicator. So, say the relationship opening date is 1/1/2003, the length of relationship is 11 years. Or it could be a more complex synthetic attribute that uses multiple raw data elements and combines them together. An example is credit card utilization, which is the balance divided by available credit limit. The problem with this kind of value enhancement is that different people could choose to do this in different ways. So creating multiple such synthetic attributes in the data ecosystem - which can be confusing to the user. Creating a data architecture which allows these kinds of synthetic attributes to be defined once and then used multiple times can be a useful solution to the problem I just described.

- The need to respond to queries to the data environment within an acceptable time interval. Also known as the service level or SLA that an application demands, any data product needs to meet business or user needs in terms of the number of concurrent users, query latency times. The raw HDFS infrastructure was always designed for batch and not for any real-time access patterns. Enabling these patterns requires the data to be pre-organized and processed through some kind of batch approach - so as to make it consumption ready. That when combined with the need to maintain close to real-time data relevance, means that the architecture needs to use elements beyond just the basic Hadoop infrastructure.

These are just some of the reasons why BigData applications and implementations need to be pay special attention to the architecture and the choice of the different component systems. 

Tuesday, September 23, 2014

A/B Testing - ensuring organizational readiness

In my previous post on the subject of A/B testing, I had talked about the need for operational and technical readiness assessment before one embarks on A/B testing. It is essential to make sure that data flows in the overall system are designed well enough to make sure user behavior can be tracked. Also the measurement system needs to be robust enough to not break when changes to the client (browser or mobile) are introduced. In reality, this is achieved by a combination of a robust platform as well as disciplined coding practices while introducing new content.

But equally important as operational/ technical readiness is organizational readiness to embrace Test and Learn. A few reasons why the organization might not be ready (despite mouthing all the right platitudes).

First, inability to recognize the need for unbiased testing in the "wild". A lot of digital product managers tend to treat usability studies, consumer research/ empathy interviews and A/B testing as somewhat interchangeable ideas. Each of these techniques have a distinct use and they need to complement each other. Specifically, A/B testing achieves the goal of evaluating a product or a feature in an environment that is most like what a consumer is likely to experience. There is no one-way mirror, no interviewer putting words in your mouth - it is all about how the product works in the context of people's lives and whether it proves itself out to be useful.

To remedy this, we have had to undertake extensive education sessions with product managers and developers around the value of A/B testing and building testing capability into the product from the get-go. While for a lot of people deep in analytics tend to find testing and experimentation a natural way to go, this approach is not obvious to everyone.

Second, the reason why A/B testing and experimentation is not embraced as it needs to be is risk aversion. There are fears (sometimes justified) that doing something different from the norm is going to be perceived by customers as disruptive to the experience they are used to. Again, this is something that needs constant education. Also, instead of doing a 50/50 split, exposing the new experience only to a small number of visitors or users (running into several thousands but in all, less that a few percentage points of the total traffic a site would see) is the way to go.

Additionally, having a specific "testing" budget agreed upfront and ensuring transparency around how the budget is getting used is an excellent way to mitigate a lot of these mostly well-meaning but also mostly unnecessary concerns.

What do you think about organizational and technical readiness? How have you addressed it in your organization while getting A/B testing off the ground? Please share in the comments area.

Friday, September 12, 2014

A/B Testing - some recent lessons learned - First part out of many

We have been making a slow and steady journey towards A/B testing in my organization. No need to really explain the value or the need for A/B testing. Experimentation is quite simply the only way to understand causation from correlation - also the real only way to measure whether any of the product features we build really even matter.

In the past 12 months, we have had some important lessons about testing and importantly, the organizational readiness required before you go and buy an MVT tool and start running A/B tests at large. And these are:

1. Ensuring organizational and infrastructural readiness
2. Building a culture of testing and establishing proofs of concept
3. Continuous improvement from A/B testing at scale

In my opinion, the first and most important step is creating the baseline in terms of organizational and infrastructural readiness. Despite the best intentions of learning from testing, there can be a number of important reasons why testing just does not get off the ground. 

A poor measurement framework is one such big reason. An online performance measurement solution such as Adobe SiteCatalyst is only as good as the attention to detail in implementation and the robustness of the implementation. In our case, though the overall implementation was useful in giving some good online behavior measurement, the attention to detail in ensuring every single action on the site was measurable was just not there. As a result, a few initial attempts at testing proved to be failures - meaning the test was not readable and needed to be abandoned. Not only was this wasted effort from testing. This was also a meaningful setback that re-inforced another belief in the organization, that testing is both risky and unnecessary for an organization that gets customer research and usability right.  This brings me to the next part of readiness - which is organizational readiness. Which will be my next post.

Saturday, February 2, 2013

Applications of Big Data Part 3: Creating summary tables from detailed transaction data

The next part in an ongoing series about the value of big data. The first post tried to draw a difference between the value delivered by better analytics generically speaking and by big data, specifically. We talked about 6 specific applications of big data:
1. Reducing storage costs for historical data
2. Where significant batch processing is needed to create a single summarized record
3. When different types of data need to be combined, to create business insight
4. Where there are significant parallel processing needs
5. Where there is a need to have capital expenditure on hardware scale with requirements
6. Where there are significantly large data capture and storage needs, such as data being captured through automated sensors and transducers

In the previous post, I talked about how one of the applications of Big Data is better management of historical data. The low cost storage and the ready accessibility means that historical data can be stored in a Hadoop cluster – and eschewing some of the legacy storage solutions like tape drives, that are less reliable as well as take longer to retrieve.

In this post, I am going to talk about when there is a business need to create a summarized data table on lots and lots of records of underlying data which are event level. There are a large number of use-cases where the business need to create a summarized record exists. The most basic is in reporting and aiding management decision making. Reporting requires information to be summarized across multiple business segments and at a weekly or a monthly frequency. A good example is Personal Financial Management or PFM systems, which classify credit card transactions and provide summaries across merchant categories and for a month. The individual credit card transactions would be stored as individual records across multiple different storage units and a Map-Reduce algorithm would run as a batch program that summarizes this information.

Another application is as segmentation or targeting variables in any kind of marketing and personalization campaign. A third application that is particularly relevant in the Digital marketing and e-commerce world is for use in recommender systems. Recommender systems make product suggestions based on customer profile and their past behaviors – given that these recommendations need to be made in mere milliseconds in the middle of a marketing campaign, running through all available records to extract the relevant piece of profiling information is not technically feasible. What is better is to have a batch job running overnight that summarizes information and creates a single record (with several hundreds of fields) for each customer. Read this link from the IEEE Spectrum magazine "Deconstructing recommender systems" for a particularly good exposition on recommender systems.

Architecture of a data summarization application

So what would the data architecture of such a solution look like? Clearly, the Big Data portion of the overall stack, the transaction level data, would reside in a Hadoop cluster. This would give the unbeatable combination of cheap storage and extremely fast processing time (by virtue of the MapReduce capabilities). The relative shortcoming of this system, the inability to provide random access capabilities to an outside application that reads the database, is irrelevant here. That is because the sole objective of the Hadoop cluster would be to ingest transaction level data and convert into summary attributes through a batch program.

The summary table would need to be built on a traditional RDBMS platform, though there are BigData variants as well like MongoDB that could do the job. The need here is for fast random access for marketing applications, recommender systems and other users of the summary information.

So to summarize, Big Data lends itself extremely nicely to creating data tables that aggregate transaction-level data into entity-level (the most common entity being a customer) information. The work itself would be done through a batch job that can take advantage of MapReduce.

In my next post, I will start to touch upon how BigData is ideally suited to process different types of data, both structured and unstructured.

Friday, January 11, 2013

Applications of Big Data Part 2: Reduced storage costs for historical data

Coming back to my topic on the business rationale for big data or “Why does Big data make sense?" In a previous post on the Business Applications of Big Data, I mentioned six specific applications and benefits of big data in the modern large organization. While doing so, I have been specific about benefits from big data technology and tried to draw a distinction from what are generic benefits from data-driven analytics.

One of my pet peeves when I read about the benefits of big data are that they often relate to the benefits of data and analytics more generically, or (if the author is trying to be at least somewhat intellectually honest), unstructured data and text analytics. Take for example this excerpt from McKinsey’s report on big data. The reason why I am picking on McKinsey here is because they consider themselves (and are considered, in some circles) to be the smartest guys in the room. And I’d have expected them to be a little more discerning when it comes to differentiating between data/ analytics driven business insights and the somewhat narrow technical area, which is big data.

McKinsey says in its report
There are five broad ways in which using big data can create value. First, big data can unlock significant value by making information transparent and usable at much higher frequency. Second, as organizations create and store more transactional data in digital form, they can collect more accurate and detailed performance information … and therefore expose variability and boost performance. Leading companies are using data collection and analysis to conduct controlled experiments to make better management decisions …Third, big data allows ever-narrower segmentation of customers and therefore much more precisely tailored products or services … Fourth, sophisticated analytics can substantially improve decision-making. Finally, big data can be used to improve the development of the next generation of products and services. For instance, manufacturers are using data obtained from sensors embedded in products to create innovative after-sales service offerings such as proactive maintenance (preventive measures that take place before a failure occurs or is even noticed).

Now, ALL of the data points talked about are either too generic (‘can substantially improve decision making’, ‘create the next generation of products and services’) or are things that apply more generically to good data/ analytics based business models (‘use of data at higher frequency, ‘more transactional data’, ‘narrower segmentation and precisely tailored products and services’). And so for someone who is trying to understand specifically whether to stay with traditionally RDBMS or embrace big data, this kind of commentary is useless. What I am going to try and do is to call out some of the benefits of big data that are uniquely driven by the specific big data technologies.

The business reasons why big data is a useful idea for organizations to embrace and implement comes down to a few specific things. All of these have to do with the fundamental technology innovation that drove big data’s growth in the first place. As David Williams, CEO of Merkle explains, if big data was merely an explosion in the amount of data suddenly available, it would be called ‘lots of data’. There's clearly more to this phenomenon, particularly since lots of data has always existed. So what are these technology innovations that typify big data?

These are parallel storage and computation, on commodity hardware using open source software. Often, the hardware is centrally located and managed and connected to the user community through high-speed internet connections. And therefore, the data is not local, but rather resides in a ‘cloud’. These innovations in turn translate to a number of benefits
-         Lower cost of storage (as compared to traditional technologies like a database storage, or tape storage) at lower cost
-         Lower latency in getting access to really old data
-         Faster computing in situations where batch computation suffices (the operative words here are ‘batch’ and ‘suffices’.) Random update and retrieval of individual records, and computation in real-time are not strengths traditionally associated with big data through there are some hybrid providers that now are able to straddle real time processing and batch processing somewhat.
-         Flexible database schema, which makes the data infrastructure scalable in the columnar dimension (now, I am sure I made up that phrase). This has not been a direct technology innovation from the original big data architecture as envisaged by Yahoo! and Google, but rather can be considered part of the overall big data ecosystem

It is the first of these technological innovations that lead to the first big business rationale from big data – which is better access to and eventually better use of historical data. The availability of a large amount of historical data translates to better analytics and better predictive models, all else being equal. There is actual empirical data based on which decisions can be taken as against taking educational guesses.

Before big data, organizations did one of several things to manage the large amount of historical data they invariably built up over time. Some of them just threw the data away, after establishing a certain retention period for the data – this would typically be 24-48 months. Others retained portions of the data and threw the rest of it away. So if a certain business operation generated 100 elements of performance data, organizations would retain the ‘important’ ones (the ones typically used for financial reporting and planning) and would throw the rest away. The third strategy was to keep the data but do so in an off-line medium like storage tapes. The problem with storage taps is that they tend to degrade physically and the data is often lost. If not, the data is simply too difficult to retrieve and bring it back on line and so analysts seldom take the trouble of chasing after this data.

With the advent of big data, it is now possible to put historical data away in low-cost, commodity storage. Now this storage is a. low-cost, b. can be retrieved relatively quickly (not necessarily on demand, like one would be able to get from an operational data store) but not with a latency of several days), and c. does not degrade like tapes do. This is one big advantage of big data.

So, if your organization generates a lot of performance data, and the default strategy for managing this data load has been simply to throw the data away, then big data helps in creating an easily accessible storage mechanism for this data. The easy accessibility means that analysts and decision-makers in the organization can use the historical data to delve deep into the data and come up with patterns. This in turn is an enabler of smarter decisions. Big Data therefore enables smarter decisions indirectly - it is not a direct contributor. The analytics that result out of long and reliable historical data drive the smarter decision making.

Saturday, January 5, 2013

Applications of Big Data for Big companies - Part 1

In this post, I am going to start to elaborate on why big data makes sense. Now, this doesn’t clearly sound like ground-breaking insight. You can google “Big data” and you can come up with literally hundreds of articles that will invariably say how the amount of data generated in the world exceeds the storage capacity available. That customers are generating petabytes of data through their interactions, feedback, etc. That cost of computing and storage is a fraction of what it used to be even ten years back. That Google, Amazon, Facebook have invested in big data infrastructure by setting up commodity servers.

But what I have personally found missing in all of this megatrend information, is that there is rarely a clear articulation of why a big company should embrace big data. There are a number of good reports and industry studies on the subject, and the McKinsey report on big data  is an exceptional read (the graphic above is derived from the McKinsey Global Institute’s study on big data) – but all of them spend an extensive amount of time making the case for big data technologies, and not enough time, in my opinion, on the business rationale that makes it inevitable for an organization to invest in big data.

So in my understanding of the space, what are some of these elements of business rationale that support investment into big data? (I have to qualify my statement, that these would apply to a typical large organization that already has a well-established RDBMS or traditional-data-based infrastructure. For a start-up, using big data technologies for one’s data infrastructure is a no-brainer decision. The question of rationale comes up when an organization has considerable already invested in traditional data and where the adjustment to introduce big data technologies into the overall ecosystem is not going to be trivial.)

There are 6 specific areas where I have been able to find a sound business rationale for investing in big data. These are:
1. Reducing storage costs for historical data and allowing data to be retained for extended periods and making it readily accessible
2. Where significant batch processing is needed to create a single summarized record (for different downstream business decisions) Creating a single summarized record based on batch processing
3. When different types of data need to be combined, to create business insight – or rather to get slightly more specific, to create a single summarized customer-level record
4. Where there are significant parallel processing needs
5. Where there is a need to have capital expenditure on hardware scale with requirements
6. Where there are significant data capture and storage needs

In subsequent posts, I will make these different elements of business rationale tangible through specific business situations.

Tuesday, January 1, 2013

Back to the blog and de-hyping Big Data

Getting back to writing this blog after really long. What happened in the middle? Well, I got lazy and I got somewhat busy.

But obviously the world has not stood over this period. The topic of my previous set of posts, Tesco, came and now has declared its intention to leave the US. Big Data has become, well, really big though my skepticism still is quite intact. And then there was the small matter of a presidential election where analytics and data modeling really came into its own. 

So there’s plenty to catch up on, for the inactivity of my past several months. But hey, New Year resolutions are there for a reason and so it is my commitment to be a lot more regular and disciplined about my posts.

My first post is on big data, as this is clearly going to be an important part of analytics and related infrastructure for the next several years. As you readers probably know, I started out from a place of a little bit of skepticism. My understanding has evolved a little bit over the last few months and I think I am in a much better place to articulate, primarily for myself why big data does make sense – mostly in a business sense and less in a purely tech-geeky sense. So I will try and do that over the next few posts. 

But let me first start off with a reference to a post by Bill Franks, Big Data evangelist and Chief Analytics Officer for Teradata Alliances. Bill has spoken about big data extensively and most recently, has mused whether big data is all hype.

His take is interesting, in that he does not think the big data story is built on an empty premise. There are genuine underlying business problems that need solving and genuine underlying technologies that provide a set of viable options to solve the problems. But he does believe that there are a multitude of technology options coming out – almost on a daily basis and that a shakeout amongst the players is imminent. Also, organizations will realize that just installing a Hadoop cluster is not the Big Data destination. The destination is a analytics and data infrastructure solution that is fast, cheap and scalable which does exist today, but which is “potential that can only be extracted with a concerted, focused, and intelligent effort”. 

My own quest has been to define for myself why does big data make sense from a business standpoint. Especially for a big Fortune 500 company, with the underlying assumption that there are different sets of economic motivators for big organizations vs. start-ups. I have been trying to educate myself through building up a detailed understanding of the underlying technologies, speaking to industry experts and practitioners and attending industry seminars. I will share my findings over the next few posts.