Pergunta

I'm dealing with a data set that has some obvious errors in the data (ie kid that's < 1yr old with a $50,000 credit card balance). I can't go thru line by line as set is >100k lines. Is there any formal work done on how to search for these types of obvious problems in data sets or even better any packages in R? Or should I just start doing histograms?

Foi útil?

Solução

As far as I know there is no such package. It seems like what you're asking for is very specialized. I think you're really looking for anomalies or outliers. Though it would be cool to have some thing that regressed all variables on the others and searched for potential extreme outliers (probably not that hard to make)

2 thoughts:

1) a scatterplot of variable's you'll conect such as age and income. Even with 100k lines that one (1 yr old making 50K) would pop up way away from all the others.

2) Running regression and looking at the plot of the model. There's some pretty good outlier detection there.

3) Search through the standardized residuals and look for values above 2 or most likely 3 sd's with which statement that indexes the observation numbers of the data.

Something like: dataframe[which(rstandard(model)>3), ]

Outras dicas

There was a session on this at the UseR2011 conference this year. I remember it well because I chaired it :)

http://www.warwick.ac.uk/statsdept/user-2011/schedule/thursday.html

The 'deducorrect' and 'editrules' packages might help you, and some of the other talks in that session might have some pointers too.

Data Management, MS.01, Chair: Barry Rowlingson

Susan Ranney It's a Boy! An Analysis of Tens of Millions of Birth Records Using R [Slides]

Joanne Demmler Challenges of working with a large database of routinely collected health data: Combining SQL and R [Slides]

John Bryant Demographic: Classes and Methods for Data about Populations

Mark van der Loo Correcting data violating linear restrictions using the deducorrect and editrules packages

There are methods for outlier detection, such as LOF, Local Outlier Factor. This method tries to detect objects that deviate significantly from similar objects. It goes beyond simple global histograms. So the value of $50000 may not be unusualy globally, but when you look for similar records, either the age deviates strongly, or the balance. This is what is called a "local" outlier.

I don't know if there is a R package for it. Maybe, maybe not. Depending on your use case - since age and balance are very different domains, a naive implementation with Euclidean distance will probably not do anyway.

For this kind of tasks, I like using ELKI. It is very customizable - you can implement custom distance functions, it is explained in a distance function tutorial. And since it uses index structures, it is rather fast. I don't think there are any good data index structures available for R.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top