It is rare to find a piece in the media nowadays that doesn't have a certain "view" on the important social and economic issues of today. Underlying every opinion piece is ideology of some sort. Slavish commitment to the ideology results in the writer typically producing such a biased view that it only appeals to those who are already prejudiced with the same view. It has become extremely difficult (especially with the evolution of the Internet and with blogs) to reach an informed and balanced view on a subject by referencing an authoritative piece on the subject.
Which is why I was pleasantly surprised to come across this piece in the Washington Post today. The piece by Gregory Clark, a professor of economics at UC, Davis presents the view that many of us are afraid to admit. And that is that the US will very soon be forced to confront a reality where the technological advances in the economy today creates its own Haves and Have-Nots. And the chasm between the Haves and Have-Nots would be so huge and so impossible to bridge that the government will be forced to play an equalizing role, so that the social order in society remains more or less intact. So how is new technology creating this chasm? More importantly, for me and my readers, what are the kinds of knowledge-worker jobs that are going to be valued in the twenty-first century?
The last fifty years of the second millenium have been marked by the emergence of the computer. A machine designed to do millions of logical and mathematical operations in a fraction of a second, the computer has now started to take over a vast majority of the computing and logical thinking that human beings would usually perform. With the ability of the computer (through programming languages) to execute long sequences of operations at high speed, the end-result is a powerful "proxy" intelligence that can be harnessed to do both good and harm. And this proxy-intelligence is taking the place of traditional intelligence; the role performed by human beings in society. And this intelligence comes without moods, expectations of recognition/ praise; in fact, without any kind of the emotional inconsistencies and quirks shown by human beings. No surprise that many of the front-end business processes involving the delivery of basic and transactional services to consumers is being replaced by the computer (such as the ATM machine). With the computer becoming an increasingly integral part of the economy, I see two kinds of jobs that knowledge-workers can embrace in this economy. I am going to cover one of these roles in this post and the second, in the next post.
The first role is that of the accelarator towards an increasing automation of simple business processes. The cost benefit of the computer over human beings is obvious; however in order for the computer to perform even in a limited way like human beings, detailed instruction sets with logical end-points at each node need to be created. It requires the imagination and creativity of the human mind to do this programming in a really effective manner - i.e. the computer actually being able to do what the human being in the same position would have been able to do. Also, it requires human ingenuity to engineer the machine to do this efficiently - within the desired speed and operating cost constraints. This role of an accelarator or an enabler of the "outsourcing" of hitherto human performed activity to machines will be increasingly in demand over the next 10-15 years.
This role will require a unique mix of skills. First and foremost, the role requires a detailed understanding of business processes, the roles played by the various players, the inputs and outputs at various stages. The business process understanding needs to span multiple companies and industries. Let's take something that Clark refers in his article: change to a flight reservation. The business process calls for not just access to the reservations database and the flights database, but also things like changing meal options, providing seating information (with information about the aircraft seating chart), reconfirming the frequent flyer account number, etc. Additionally, providing options for payment if there is going to be a fee involved.
Second, the role requires the ability to understand the capabilities of IT platforms and packages to able to perform the desired function. This role actually has two components. One is the mapping of human actions into the logic understood by a computer system. The second is the system architecture/ engineering side, which is the configuration of the various building blocks (comprised of different IT "boxes" delivering different functionality) to create an end-to-end process delivery capability. Given the lack of standards that exist for these types of solutions, any deep skills in this area involves understanding the peculiarities of specific solutions in a great level of detail.
I'd love to hear more from readers on this. Have you seen these roles emerging in your industry? What other types of skills does the enabler or accelarator role need?
A place where models, software engineering and the science of decision making come together to make the future a better place
Sunday, August 9, 2009
Monday, August 3, 2009
Why individual level data analysis is difficult
I recently completed a piece of data analysis using individual level data. The project was one of the more challenging pieces of analysis I have undertaken at work, and I was (happily, for myself and everyone else who worked on it) able to turn it into something useful. And there were some good lessons learned at the end of it all, which I want to share in today's post.
So, what is unique and interesting about individual level data and why is it valuable?
- With any dataset that you want to derive insights from, there are a number of attributes about the unit being analyzed (individual, group, state, country) and one or more attributes that you are trying to explain. Let's call the first group predictors and the second group target variables. Individual level data has a much wider range of predictor and target variables. There is also a much wider range of interactions between these various predictors. For example, while on an average, older people tend to be wealthier, individual level data reveals that there are older people who are broke and younger people who are millionaires. As a result of these wide ranges of data and the different types of interactions between these variables (H-L, M-M, M-H, L-H ... you get the picture), it is possible to understand fundamental relationships between the predictors and the targets and interactions between the predictors. Digging a little deeper into the people vs wealth data, what this might tell you is that what really matters for your level of wealth is your education levels, the type of job you do, etc. This level of variation is not available with the group level data. In other words, the group level data is just not as rich.
- Now, along with the upside comes downside. The richness of the individual level predictors means that data occassionally is messy. What is messy? Messy means having wrong values at an individual level, sometimes missing or null values at an individual level. At a group level, many of the mistakes average themselves out, especially if the errors are distributed evenly around zero. But at the individual levels, the likelihood of errors has to be managed as part of the analysis work. With missing data, the challenge is magnified. Is missing data truly missing? Or did it get dropped during some data gathering step? Is there something systematic to missing data, or is it random? Should missing data be treated as missing or should it be imputed to some average value? Or should it be imputed to a most likely value? These are all things that can materially impact the analysis and therefore should be given due consideration.
Now to the analysis itself. What were some of my important lessons?
- Problem formulation has to be crystal clear and that in turn should drive the technique.
Problem formulation is the most important step of the problem solving exercise. What are we trying to do with the data? Are we trying to build a predictive model with all the data? Are we examining interactions between predictors? Are we studying the relationship between predictors one at a time and the target? All of these outcomes require different analytical approaches. Sometimes, analysts learn a technique and then look around for a nail to hit. But judgment is needed to make sure the appropriate technique is used. The analyst needs to have the desire to learn to use an technique that he/she is not aware of. By the same token, discipline to use a simpler technique where appropriate.
- Spending time to understand the data is a big investment that is completely worth it.
You cannot spend too much time understanding the data. Let me repeat that for effect. You cannot spend too much time understanding the data. And I have come to realize that far from being a drudge, understanding the data is one of the most fulfilling and value added pieces of any type of analysis. The most interesting part of understanding data (for me) is the sheer number of data points that are located so far away from the mean or median of the sample. So if you are looking at people with mortgages and the average mortgage amount is $150,000, the number of cases where the mortgage amount exceeds $1,000,000 lends a completely new perspective of the type of people in your sample.
- Explaining the results in a well-rounded manner is a critical close-out at the end.
The result of a statistical analysis is usually a set of predictors which have met the criteria for significance. Or it could be a simple two variable correlation that is above a certain threshold. But whatever be the results of the analysis, it is important to base the analysis result in real-life insights that can be understood by the audience. So, if the insight reveals that people with large mortgages have a higher propensity to pay off their loans, further clarification will be useful around the income level of these people, their education levels, the types of jobs they hold, etc. All these ancillary data points are ways of closing out the profile of the "thing" that has been revealed by the analysis.
- Speed is critical to get the results in front of the right people before they lose interest.
And finally, if you are going to spend a lot of everyone's (and your) precious time doing a lot of the above, the results need to be driven in extra-short time for people to keep their interest in what you are doing. In today's information-saturated world, it only takes the next headline in the WSJ for people to start talking about something else. So, you need to basically do the analysis in a smart manner, and also it needs to be super-fast. Delivered yesterday, as the cliche goes.
In hindsight, it gives me an appreciation of why data analysis or statistical analysis using individual level data is one of the more challenging analytical exercises. And why it is so difficult to get it right.
So, what is unique and interesting about individual level data and why is it valuable?
- With any dataset that you want to derive insights from, there are a number of attributes about the unit being analyzed (individual, group, state, country) and one or more attributes that you are trying to explain. Let's call the first group predictors and the second group target variables. Individual level data has a much wider range of predictor and target variables. There is also a much wider range of interactions between these various predictors. For example, while on an average, older people tend to be wealthier, individual level data reveals that there are older people who are broke and younger people who are millionaires. As a result of these wide ranges of data and the different types of interactions between these variables (H-L, M-M, M-H, L-H ... you get the picture), it is possible to understand fundamental relationships between the predictors and the targets and interactions between the predictors. Digging a little deeper into the people vs wealth data, what this might tell you is that what really matters for your level of wealth is your education levels, the type of job you do, etc. This level of variation is not available with the group level data. In other words, the group level data is just not as rich.
- Now, along with the upside comes downside. The richness of the individual level predictors means that data occassionally is messy. What is messy? Messy means having wrong values at an individual level, sometimes missing or null values at an individual level. At a group level, many of the mistakes average themselves out, especially if the errors are distributed evenly around zero. But at the individual levels, the likelihood of errors has to be managed as part of the analysis work. With missing data, the challenge is magnified. Is missing data truly missing? Or did it get dropped during some data gathering step? Is there something systematic to missing data, or is it random? Should missing data be treated as missing or should it be imputed to some average value? Or should it be imputed to a most likely value? These are all things that can materially impact the analysis and therefore should be given due consideration.
Now to the analysis itself. What were some of my important lessons?
- Problem formulation has to be crystal clear and that in turn should drive the technique.
Problem formulation is the most important step of the problem solving exercise. What are we trying to do with the data? Are we trying to build a predictive model with all the data? Are we examining interactions between predictors? Are we studying the relationship between predictors one at a time and the target? All of these outcomes require different analytical approaches. Sometimes, analysts learn a technique and then look around for a nail to hit. But judgment is needed to make sure the appropriate technique is used. The analyst needs to have the desire to learn to use an technique that he/she is not aware of. By the same token, discipline to use a simpler technique where appropriate.
- Spending time to understand the data is a big investment that is completely worth it.
You cannot spend too much time understanding the data. Let me repeat that for effect. You cannot spend too much time understanding the data. And I have come to realize that far from being a drudge, understanding the data is one of the most fulfilling and value added pieces of any type of analysis. The most interesting part of understanding data (for me) is the sheer number of data points that are located so far away from the mean or median of the sample. So if you are looking at people with mortgages and the average mortgage amount is $150,000, the number of cases where the mortgage amount exceeds $1,000,000 lends a completely new perspective of the type of people in your sample.
- Explaining the results in a well-rounded manner is a critical close-out at the end.
The result of a statistical analysis is usually a set of predictors which have met the criteria for significance. Or it could be a simple two variable correlation that is above a certain threshold. But whatever be the results of the analysis, it is important to base the analysis result in real-life insights that can be understood by the audience. So, if the insight reveals that people with large mortgages have a higher propensity to pay off their loans, further clarification will be useful around the income level of these people, their education levels, the types of jobs they hold, etc. All these ancillary data points are ways of closing out the profile of the "thing" that has been revealed by the analysis.
- Speed is critical to get the results in front of the right people before they lose interest.
And finally, if you are going to spend a lot of everyone's (and your) precious time doing a lot of the above, the results need to be driven in extra-short time for people to keep their interest in what you are doing. In today's information-saturated world, it only takes the next headline in the WSJ for people to start talking about something else. So, you need to basically do the analysis in a smart manner, and also it needs to be super-fast. Delivered yesterday, as the cliche goes.
In hindsight, it gives me an appreciation of why data analysis or statistical analysis using individual level data is one of the more challenging analytical exercises. And why it is so difficult to get it right.
Subscribe to:
Posts (Atom)