Question

I am doing an experiment on Azure ML. While pre processing my data, there is an option to clean missing data using either PCA or MICE.

Please provide me an example of how I can decide on which option to choose.

Was it helpful?

Solution

I don't know about Azure ML. But:

PCA is principal components analysis. It takes a dataset and "rotates" it, taking the original axes defined by the original variables, and creating new axes that are linear combinations of the old data. The precise linear combinations are chosen such that each successive component maximizes variance along that new dimensions. A quick google search turns up lots of tutorials.

Here is a snipped of Hastie & Tibshirani's lecture on PCA https://www.youtube.com/watch?v=ipyxSYXgzjQ

MICE is "multiple imputaiton by chained equations". Basically, missing data is predicted by observed data, using a sequential algorithm that is allowed to proceed to convergence. (1) Start by filling in the missing data with plausible guesses at what the values might be. (2) for each variable, predict the missing values by modeling the observed values as a function of the other variables. At each step, update the predictions of the missing values. There are many tricky details, and many online tutorials. here is an article aimed at biostat practitioners:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top