One of the widely used data extrapolation techniques is thePrincipal Component Analysis, a statistical method for extracting correlated variables in a data set. Social scientists use it to find answers in questionnaires that have a common pattern in addition to eliminating skewed answers. At iHub Research's Data Science Lab, we primarily utilize the technique to perform outlier detection.
During an analysis exercise, we realized one variable appeared as an outlier when running PCA on the whole dataset, but clustered with other attributes while using a sample dataset. This meant the conclusion of the variable being an outlier does not hold if data is split. We were a bit worried, so we set out to find a means of determining if a conclusion is likely to change with increasing/decreasing data.
Luckily enough, someone had a similar problem back in 1966. Nathan Mantel, a Biostatistician at the National Cancer Institute in the US was concerned if the reported cases of Leukemia were related. He devised a method, the mantel’s test, to evaluate if the characteristics of a reported disease remained the same. So we got the idea of applying the same test to the Principal Component Analysis results.
We wrote a script in the R statistical programming language that splits the dataset into several parts, perform a PCA, then apply the mantel's test to check if the results remained constant within the subsets. The diagram above shows the p-value for the null hypothesis (these two conclusions are the same) vs the number of data-points used.
As subsets with close data-points are used, the p-value stands at 0.001 which means rejecting the null hypothesis and adopt the alternative hypothesis – the conclusions the same. However, when subsets are far apart the p-value raise exponentially to indicate the conclusions aren’t the same. At this point, I was happy I could at the least pinpoint if the conclusion is bound to change and at which point. Next, we applied the same code/technique to several currency-pairs trading at the exchange market . Here's the graph produced.
As observed, the p-value does not change. It means the conclusion drawn from the PCA analysis does not change as more data points are added. This was particularly interesting and we came to the conclusion that there are two types of datasets, those that have a dynamic variations and those with static variation.