Pergunta

Okay, so the scenario is this:

I ask a question "How much do you spend on food per week?"

After a specified number of contributions, let's say 100 I would like to scan the results, and find obvious bogus results. So say the average is £80 but some numpty has put in a value of 1 and someone else has put in 10,000.

The requirement is to NOT validate the data on entry, rather dynamically analyse the data and determine the valid range of data, trimming outliers from the results during a statistical update of the database.

What is the best method to achieve this using Rails 3.2, ActiveRecord and Postgresql?

Foi útil?

Solução

A good way to eliminate the erroneous results is to work out the standard deviation. You can do this using posttgres:

SELECT stddev(amount) FROM answers

You can then see if an answer falls outside this and remove if required.

Note that this will always remove some answers, so if you're not expecting any numpties then don't do it this way.

Outras dicas

So say the average is £80 but some numpty has put in a value of 1 and someone else has put in 10,000.

Say someone has put in £117. Is that an outlier? What about £127? £137?

Identifying outliers is a statistical job, not really a database job. You can only do the job well when the database returns all the relevant data. If you're writing statistical software in Ruby, then I'd say it's Ruby's job (the Ruby programmer's job) to help you decide which values are outliers and which are not.

Having determined which values are outliers, it's simple to eliminate them from calculations, either by run-time exclusions, or by running the query again with a range like amt_spent >= 53 and amt_spent <= 117. But consider more robust statistical techniques, techniques that aren't much affected by outliers.

You can also delete those rows from the database, but that can be misleading. I never do that, myself.

Detection of outliers

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top