One of the cardinal principles of predictive analytics is that you are only as good as the data that you use to build your analysis. However, another important principle is that the data handling processes also have to be well managed and generally free of error.
Recently a set of incidents came to.light which talked about the damage that can be caused by indifferent data handling process. This was in the field of cancer research which points to the human cost from some of these mistakes. One of the popular recent techniques in cancer research analysis of gene level data is micro array analysis. A primer on what this analysis involves can be found in this link here
Duke University cancer researchers promised some revolutionary new treatments of cancer. But when patients actually enrolled in trials, the results were disappointing. Then the truth came out. The analysis was done wrong and the reports resulted from some elementary errors in data handling by the researchers. Two researchers, Baggerly and Coombes, who had to literally reverse engineer the analytical approach used concluded that some simple errors resulted in the wrong conclusions.
A few takeaways for a data scientist:
1. Data handling scripts and processes need to be checked and double-checked. Dual validation is a well-known technique; it is also known as a parallel run. The idea here is to have two independent sets of analysts or systems to process the same input data and make sure the outputs are the same.
2. Data handling needs to be well-documented. The approach used to arrive at a set of significant findings can never be shrouded in mystery, either intentionally or because of sloppy documentation. At best, it gives the appearance of slipshod and careless work. At worst, it appears like a deliberate deception. Neither of these impressions are good ones to make.
A summary presentation from Baggerly and Coombes about this issue can be found here.