Big Data Analytics

Tuesday, June 9, 2015

Why your predictive analytics solution might just not work

I got a fair bit of interest and comments with my previous post on how the predictive analytics journey can feel a bit like a roller coaster.

To summarize, the four main phases in building a predictive solution are
- Data munging => Sorrow
- Getting the first model output => Joy
- Realizing that the model still needs a lot of work => Sorrow
- Applying a seemingly imperfect model and seeing impressive customer or business impact => Joy

I also want to add a fifth phase to this - which is deploying the solution. Which in itself can be even more challenging and frustrating than actually building the solution.

In many organizations, the systems supporting the development of the predictive solution and the systems required to implement or deploy the solution are quite different. Analytical data warehouses (mostly the foundation for analytic model builds) are often optimized to generating insights and capturing additional insights for others to leverage. Therefore they contain lots of transformed variables and data. When these analytic stores are used to build models, these transformed variables invariably make their way into the model. Operational systems are optimized for speed of transaction processing and are therefore quite removed from the analytical data. The picture looks somewhat like the one below.

Fig 1: How analytic and operational data can diverge from source systems

This disconnect between analytic systems and operational systems can be a BIG problem when it comes to monetizing the analysis. When you go and try deploy the cool analytical model you just built, it is either really difficult to productionalize - or even worse, it cannot be implemented without considerable watering down.

This could be because:
- The transformations in the analytic data need to be re-done using the operational data, and that introduces delays in the deployment process
- Some of the data elements available in the analytic store might be altogether missing in the operational store

So that's delay in either getting to market or a fraction of the true value gets realized.

As a person who passionately believes in how data and analytics can change consumers life for the better, this is the most frustrating trough in the overall journey. Interestingly this last trough is a massive blind spot in the cognition of several well meaning predictive modelers. Unless you carry the battle scars of trying to deploy these models in production, it is often to even conceptualize that this gap can exist.

Got to remember this though. Unless you are able to deploy your solution to a place where it touches real people, things really don't matter a whole lot. My next post will cover how to close some of these gaps.

Monday, June 1, 2015

Journey through building a Predictive Analytics solution

I have now spent nearly 10 years building predictive models. These have ranged from some detailed segment or population level models using Excel to cutting-edge individual level models for credit risk, response propensity, life-events predictions and so on using statistical packages in R and Python, and back in the days, using SAS. At some point in the last ten years, I also did a foray into text analytics and trying to make sense out of unstructured data.

Building predictive models is tough and takes tme and energy. It is also emotionally consuming and exhausting. Over the years, I have been able to identify four distinct phases through the build-out of any model where I have gone through alternate cycles of depression and joy. Thought I’d share some of that and see whether these cycles resonated with others in the same line of work. And it starts to look somewhat like the picture below.

Phase 1: “Getting the data is really REALLY hard!!!!”

The first phase of depression happens roughly for the first third of any project. You are all excited about this really cool customer problem or business challenge you have in front of you and you have just discovered the perfect predictive modeling solution for the problem. Now all you need to do is to get the data and you will be off to Predictive Modeling Superstardom.

Except that it never happens that way. Data is always messy and needs careful and laborious curation. And if you take shortcuts and are messy about how you handle and manage the data, it almost always comes back to extract its pound of flesh. I am sure there is that phase in the life of every model that you are just frustrated at the amount of time and effort it takes to get data right and even then you are not entirely sure whether you got the data right.

Phase 2: “WOW! I actually built a model and managed to create some predictions!”

The first light at the end of the tunnel happens when you have managed to get the data right, set up the modeling equation correctly (after several attempts at hacking away at the code) and actually ran the model to produce some predictions. And the predictions by-and-large look good and seem to actually make sense when you compare some of the standard metrics such as precision and recall, as well as when create deciles of twenty-tiles of the prediction range and are able to see some decently good predictions! That feeling of excitement and joy is amazingly good.

Phase 3: “Well, actually my predictions aren’t that good!”

The next phase of low happens when you try and examine your predictions at an individual level and discover that - by and large - your predictions are not very accurate at all. Overall, the model seems to work well but at an individual level, the predictions are really off. There are a few cases where the model absolutely nails the prediction but in nearly 60-70% of the cases, the prediction is off in either direction. Not catastrophically off but then off enough to cause some anxiety to the perfectionist in you.

Phase 4: “Phew! Actually the model didn’t do too badly”

Then you actually take the model and apply it to the problem you are trying to solve. So maybe you are looking at customer calls transcripts and trying to predict the likelihood of a follow-up call. Or you are looking at thousands and thousands of loan data and trying to predict the probability of default. Or what a digital experience adoption propensity is likely to look like around thousands of visitors. (Of course, I am assuming all along that the overall problem was structured correctly and the problem you are solving is indeed something worth solving, etc.)

Based on the model, you are hoping to take one or the other action. And then you find that the model actually makes a difference, that you have been able to create happier customers or more profitable loans or an overall better adoption of a digital experience. This feeling - that the predictive modeling solution you built - made an impact in the real world is absolutely spectacular.

I have had these moments of great satisfaction at the end of several such initiatives and there have been certainly situations when things have not exactly followed that script. In my next post, I will talk about the steps that you can take as a predictive modeler, that lead to great outcomes and certain other things that lead to less satisfactory outcomes.

Saturday, October 18, 2014

Learnings from the Strata 2014 Conference

After three hectic days at the Strata conference trying to appreciate the poetry, I am on my way back on the Acela from New York to DC. There was tons to learn from the conference and words can only do so much justice but there is a set of learnings I want to share from my perspective. Caveat: These are all colored by my knowledge, my personal context, my organizational context but a lot of learnings are things that I am sure are going to resonate with a lot of people. Also another caveat that there is no neat structuring of what I am going to share, so treat it as such. So here goes:

1. Map Reduce as we know it is already behind us

MapReduce as a specific set of technologies written in Java (not as an overall philosophy, as indeed, MapReduce has become a philosophy very similar to Agile) is already behind us. Now we had MapReduce 2.0 come out late last year and it has been an improvement definitely on MapReduce 1.0. But when it comes to large-scale ingestion of data and making it usable, the mainstream has shifted to Apache Spark. What is surprising is that Spark as a technology is fairly new and not very stable. But the pace of technology evaluation is such that people are finding use for Spark in a number of really relevant and creative ways. And in 3 years, technologies like Spark will replace what MapReduce almost entirely. (Even though some people are going to argue there is a place for both)

2. Using BigData tools vs investing in custom development on agile technologies is an important decision

With the emergence of the open source software movement and also the ability to easily share software, learning, approaches using a number of internet based platforms , it is no surprise that a lot of startups see open source as an easy way to bootstrap their product development. Over the years, open source software is becoming the norm for driving product development and data infrastructure creation within almost all tech and digital industry leaders.

With the Cambrian explosion of product development in the data space, a lot of the products being released are tools or building blocks that then allow efficiencies around data processing and data pipeline. So an organization that needs to harness and use BigData for its day to day needs has this very important decision in front of them. Should they be doing custom development on the generic open source technologies and therefore allow their solutions to evolve along with the underlying generic technology, or should they bring in third party tools for important parts of their data processing? (This is a variant of the classic Build vs Buy question, but has some nuances because of the open source explosion.)

Each decision comes with its pros and cons. Working with tools improves speed to market, but then forces the buying organization to use a set of constraints that a tool is likely to impose on them. Working on generic technologies removes this dependency and allows for natural product evolution, but this then comes at the cost of development time and lower speed to market, potentially higher costs. And these are not easy decisions. My specific observation here was around how my organization has chosen to ingest data into its HDFS environment. Should we be doing custom development using some of the open source data ingestion frameworks such as Apache Flume or Storm, or should we use a product that comes with a number of desirable features out-of-the-box like Informatica? These are not easy decisions and I think the whole Build vs Buy decision on BigData needs its own blogpost.

3. Open source is here to stay

I think I might have said this before but open source is here to stay and going through a Cambrian explosion. Enough said on that!

4. Innovation to new and dynamic technologies needs to be multi-threaded

As relative late adopters on to the BigData platform, my organization has been following a linear and established path to BigData adoption. The goal here has been being able to get to low-hanging fruit with BigData here around cost savings – by taking spend away from investing in RDBMS platforms. It is a perfectly legitimate goal to have and I think we are going about this goal in a very structured manner. But in a world of fast evolving technologies, this focus creates the risk that we end up having a blind spot within the overall ecosystem around other use-cases of the technology. In our case, real-time data use-cases and streaming analytics is a big blind-spot from my vantage point. The risk here is that by the time we achieve the low-hanging fruit by being systematic and focused, we end up losing a lot of ground in other areas and are similarly behind when the next technology wave happens.

So my view here is that we need to be multi-threaded in our technology adoption. We need to have specific goals and be focused on them to make these new technologies mainstream – but at the same time, we need to be aware of other applications of the technology and make sure there are investments in place to build our capabilities on these areas which are not immediate focus. Also, to have a SWAT team working on even newer technologies and ideas that are likely to become mainstream 12 months from now.

Just a smattering of my immediate thoughts from Strata. Like I promised, not very organized but did want to share some of my unvarnished opinions.

Wednesday, October 15, 2014

Third time at Strata and why that's like reading poetry

I am heading up to New York to the 2014 Strata conference. It has become a ritual of sorts to go to Strata in October as I have been going for the last three years.

My first trip was in 2012 where admittedly, I was going as a BigData skeptic. Or put more accurately, I was going in with an open mind about the possibilities of BigData but definitely under-exposed in terms of its capabilities and what organizations and professionals could stand to gain. I walked away with some appreciation of the BigData case studies but many of the examples seemed like applying technology for technology sake.

When I went in 2013, my perspective was certainly more informed. I was going to understand the big data ecosystem in more detail, having spent a considerable amount of time both reading up on the technologies as well as working closely with practitioners in the area. I left the conference with a much wider understanding of the entire HDFS and BigData ecosystem. Props here for Silicon Valley Data Sciences whose presentation was both detailed in both its breadth and depth, a difficult thing to accomplish.

Also, I broadened my own perspective in terms of the kinds of problems that can be solved by BigData. Previously, my field of vision was narrowly focused on business problems. The problem with that lens is that business problems are a reflection of the past - how the business operated yesterday and the challenges that were created in a pure economic sense. What this approach is blind to is the huge world of consumer and human problems that need exploring and solving using BigData. Opening my mind to a whole host of consumer/ human challenges made me aware of the need to harness and harmonize different types of data, the mash-ups and insights created and the different world one could vision for customers. I also had the opportunity within my org to work closely on a classic BigData problem - which was building a holistic view of the customer across functional and product silos. And so working with and talking to people with vastly greater experience and hands-on knowledge made me more informed, allowed me to appreciate even finer nuances in the space and form even more bold customer and business value propositions.

So as I am headed up to New York to attend Strata for the third year, my mind goes to poetry. I have always felt that you need to pass the hump of understanding language to understand the intricacies and beauty within the language. Understand the rhythm and the poetry of the language. And that requires deep study. Countless hours of reading, multiple hundreds of tweets. And hard hard hours of whiteboard sketching, debating ideas, learning from the experts.

And so here's listening to some good poetry over the next three days at Strata. Should be fun.

Monday, October 6, 2014

A/B Testing Part 3 - Building a culture of experimentation

In my previous posts, I have talked about the organizational readiness and the technical preparedness to do online A/B testing effectively. These baseline elements are foundational and need to be in place for any testing and experimentation approach to be successful.

The next stage is actually building a culture of experimentation and testing amongst product creators. There are a number of mental barriers to overcome. I have talked about the need for product managers to start appreciating the need for "testing in the wild" as a useful addition to any prototype or usability testing. Another mental barrier that comes in the way is the fear that a test might amount to nothing and therefore one shouldn't waste valuable dev cycles testing minor improvements and that testing should be "reserved" only for really big changes.

This is a place where it is important to have a schooling in developing product hypotheses and ways in which those hypotheses can be proved (or disproved). In my organization, we spent (and continue to invest) a considerable amount of time discussing the principles of testing and experimentation, and making sure that product managers walk away with a pretty good understanding of the overall "scientific method" - i.e. the need to develop and validate hypotheses through a systematic process.

We spent a good amount of time on the following questions

- What KPI or metric are you trying to influence?

So specifically, which important customer related or business related metric you are trying to impact? This is an important step to focus the experimentation effort on the things that really matter from a product standpoint. So to actually walk through an example, let us say one of the metrics we are going to impact is the "bounce rate" on a website.

- What is your hypotheses on how you can influence the metric?

What are the different ideas that can be employed to lower the bounce rate? What underlying consumer behavior are we trying to change here? By the way, a lot of these hypotheses need to be generated either from data analysis (so, something that shows that repeat visitors have high bounce rates) or from a detailed understanding of customer needs (through techniques like design thinking and empathy interviews). So one could hypothesize that one of the reasons why bounce rate is high is because our website does not effectively recognize repeat visitors on the site. Or that the reason why bounce rates are high is because of too much content on the page. Or that the call-to-action button needs to be of a different color and font to stand out from the page.

One other quick thing to point out. There might be situations where the purpose of the test is basically to generate new behavioral hypotheses and not to necessarily prove existing ones. So take a typical sales funnel with lots of form fields and a few call to action buttons. One could just come up with variants of the font used, the color of the button, the shape and size and test all of them to see which combination of form field size + font + button shape + button color is the most optimal. The results of the test could create a body of learning around what is preferred by customers, which in turn could influence other such funnels in future. This approach is also useful when there is some kind of new technology or feature to be introduced. So imagine doing a banking transaction on a wearable device like Google Glass. Given the newness of the technology, there isn't typically one proven answer and we need to get to the proven answer through experimentation.

- Finally, what is a test one could run in order to test out these hypotheses?

So specifically, what will the control experience look like and what would be the treatment? And by offering the treatment, what is the metric that we are expecting to move or impact? So in the bounce case example, the test could be

It is typical to spend a few weeks or even months just on this part of the journey - which is to inculcate a testing or experimentation culture within the organization. I do want to emphasize the need to get this culture piece right throughout the conversation. It is a known psychological quirk that human beings tend to be far more sure of things that are inherently uncertain. We are "sure" that a customer is just going to fall in love with a product we have built in a certain way, just because it was out idea. It is important to challenge this internal wiring problem, that can get in the way of true knowledge seeking.

Tuesday, September 30, 2014

Real-time analytics infrastructures

As the BigData (Hadoop) explosion has taken hold, architectures that provide analytic access to even larger data for even deeper insights have started becoming the norm in many top data science driven organizations. One of them is the approach used in LinkedIn - the LinkedIn analytic blog is typically rich with ideas on how to approach analytics.

What are some of the challenges that some of these architectures are trying to tackle?

- The need to organize data together from different data sources into one single entity. In most cases, this is typically an individual. In the case of LinkedIn, it is the professional member. For Facebook, one of its 1.2 billion subscribers. For a bank, one of its customers. While the actual integration of the data might sound trivial, the effort involved is highly manual. Especially in legacy organizations like banks that have different source systems (managed by different vendors) performing their billing, transactions processing, bill-pay, etc. the complexity of the data that comes into the bank in the form of raw files can be truly mind-boggling. Think multiple different data transfer formats (I recently came across the good old EBCDIC format), files with specific keys and identities relevant to that system, and so on. These files need to be converted into a common format that is readable by all internal applications, also organized around one internal key.

- Next, the need for the data to be value-enhanced in a consistent manner. Raw data is seldom useful to all users without any form of value-addition. This value addition could be something simple i.e. taking the relationship opening date and converting this into a length of relationship indicator. So, say the relationship opening date is 1/1/2003, the length of relationship is 11 years. Or it could be a more complex synthetic attribute that uses multiple raw data elements and combines them together. An example is credit card utilization, which is the balance divided by available credit limit. The problem with this kind of value enhancement is that different people could choose to do this in different ways. So creating multiple such synthetic attributes in the data ecosystem - which can be confusing to the user. Creating a data architecture which allows these kinds of synthetic attributes to be defined once and then used multiple times can be a useful solution to the problem I just described.

- The need to respond to queries to the data environment within an acceptable time interval. Also known as the service level or SLA that an application demands, any data product needs to meet business or user needs in terms of the number of concurrent users, query latency times. The raw HDFS infrastructure was always designed for batch and not for any real-time access patterns. Enabling these patterns requires the data to be pre-organized and processed through some kind of batch approach - so as to make it consumption ready. That when combined with the need to maintain close to real-time data relevance, means that the architecture needs to use elements beyond just the basic Hadoop infrastructure.

These are just some of the reasons why BigData applications and implementations need to be pay special attention to the architecture and the choice of the different component systems.

Tuesday, September 23, 2014

A/B Testing - ensuring organizational readiness

In my previous post on the subject of A/B testing, I had talked about the need for operational and technical readiness assessment before one embarks on A/B testing. It is essential to make sure that data flows in the overall system are designed well enough to make sure user behavior can be tracked. Also the measurement system needs to be robust enough to not break when changes to the client (browser or mobile) are introduced. In reality, this is achieved by a combination of a robust platform as well as disciplined coding practices while introducing new content.

But equally important as operational/ technical readiness is organizational readiness to embrace Test and Learn. A few reasons why the organization might not be ready (despite mouthing all the right platitudes).

First, inability to recognize the need for unbiased testing in the "wild". A lot of digital product managers tend to treat usability studies, consumer research/ empathy interviews and A/B testing as somewhat interchangeable ideas. Each of these techniques have a distinct use and they need to complement each other. Specifically, A/B testing achieves the goal of evaluating a product or a feature in an environment that is most like what a consumer is likely to experience. There is no one-way mirror, no interviewer putting words in your mouth - it is all about how the product works in the context of people's lives and whether it proves itself out to be useful.

To remedy this, we have had to undertake extensive education sessions with product managers and developers around the value of A/B testing and building testing capability into the product from the get-go. While for a lot of people deep in analytics tend to find testing and experimentation a natural way to go, this approach is not obvious to everyone.

Second, the reason why A/B testing and experimentation is not embraced as it needs to be is risk aversion. There are fears (sometimes justified) that doing something different from the norm is going to be perceived by customers as disruptive to the experience they are used to. Again, this is something that needs constant education. Also, instead of doing a 50/50 split, exposing the new experience only to a small number of visitors or users (running into several thousands but in all, less that a few percentage points of the total traffic a site would see) is the way to go.

Additionally, having a specific "testing" budget agreed upfront and ensuring transparency around how the budget is getting used is an excellent way to mitigate a lot of these mostly well-meaning but also mostly unnecessary concerns.

What do you think about organizational and technical readiness? How have you addressed it in your organization while getting A/B testing off the ground? Please share in the comments area.