Big Data Analytics: May 2011

I had shared a link on Principal Components Analysis a while back and have had the opportunity to revisit this space, or rather visit it in a professional capacity recently.

As a part of my interest, I came across a few interesting links on this subject - one of the better tutorials is here. The primary purpose behind PCA is dimensionality reduction to make analysis more efficient. Typical applications have been in the area of image processing, though of late, there has been a lot of interest in applying these techniques for micro-array data.

Some of the historical applications of PCA has been in the field of statistical process control or SPC. The genesis of the application came from the chemical industry, and the early practitioners were interestingly known as chemometricians. The aim here was to model plant yield as a function of its input parameters. The input parameters were typically the temperatures and pressures at different points in the reactor vessel and also recorded at different points in time. Since the plant operator has control over some of these parameters, these can be varied in order to improve the plant yield.

The sheer complexity of the data involved here is one complication. When processes have hundreds on inputs (temperature, pressure, gradients, energy released, moisture content - all captured by hundreds of sensors embedded within the reactor), it becomes difficult to build any meaningful models - given the limited number of observations available. What is helpful is that many of these input variables are highly correlated. The temperature at the entry point of reactor feed is going to be obviously correlated to the temperature at the center of the reactor vessel. PCA can be used to reduce the dimensionality of the inputs and model the outputs as a function of the principal components rather than the input variables. Principal components by definition is simply reducing the hundreds of correlated inputs into a few principal components (typically 3 or 4) which are a linear combination of these raw inputs. The other application here is the monitoring of these reactions. When the operator runs different reactions with different input parameters, it is important to identify 'outliers'. Places where the inputs have been so far away from norms that the outputs need to be appropriately flagged or in some cases, totally discarded. Some more details on the application of these techniques can be found here. The link goes to a paper on Principal Component Techniques by Robert Rodriguez from SAS.

These applications can be extended to other areas as well. In consumer behaviour modeling, PCA can be used to reduce the hundreds of different inputs about a consumer to the essential principal components and these can then be used to simplify the modeling and monitoring processes.

Big Data Analytics

Monday, May 30, 2011

Interesting uses of Principal Components Analysis

Wednesday, May 18, 2011

The Heritage Prize

Sitemeter