What is the approx minimum size of dataset required to build 90% correct model?

https://datascience.stackexchange.com//questions/63721

06-12-2019
|

質問

I am working with a financial dataset size which is around 3000. I have attempted the supervised-learning regression techniques and not able to go beyond 70% accuracy.

Features: 10
Data size:3700
Models attempted: Decision Trees, Random forest, Lasso Regression, Ridge regression, Linear regression

I am of the opinion that dataset size is too small to expect any good results beyond 65%. It's obvious because machine learning algorithms are data-hungry in nature. However, In a generic sense, Is there a lower-bound on the dataset size that has been found to achieve 90% accuracy?

Such a theory would also help me to gather data until I reach that point and then do some productive work.

Any help is appreciated.

解決

There is no theory or general case that sets the size of dataset required to reach any target accuracy. Everything is dependent on the underlying, and usually unknown, statistics of your problem.

Here are some trivial examples to illustrate this. Say want to predict the sex of a species of frog:

It turns out the skin colour is a strong predictor for the species Rana determistica, where all males are yellow and all females blue. The minimal dataset to get 100% accuracy on the prediction task is data for two frogs, one of each sex.
It turns out the skin colour is uncorrelated for the species Rana stochastica, where 50% of each sex are yellow and the other 50% are blue. There is no size of dataset of frog colour labelled with sex that will get you better than 50% accuracy on the task.
However, Rana stochastica does have eye colouration with almost determistic relationship to the sex of the creature. It turns out that 95% of males have orange eyes and 95% of females have green eyes (with only those two eye colours possible). Those are predictive variables that are strong enough that you can get 95% accuracy if you can discover the relationship.

Some related theory worth reading to do with limitations of statistical models is Bayes error rate.

In the last case, simply predicting "male" for orange eyes and "female" for green eyes will give you 95% accuracy. So the question is what size of dataset would guarantee a model would both make those predictions, plus give you the confidence that you had beaten your 90% accuracy goal? It can be figured out, assuming you collect labelled sample data at random - note that there is a good chance that models trained on very little data would get 95% accuracy, but that it could take a lot more data for a test set in order for you to be confident that you really had a good enough result.

The maths to demonstrate even this simple case is long-winded and complex (if I were to outline the theory being used), and does not actually help you, so I am not going to try and produce it here. Plus of course I chose 95%, but if the eye colour relationship was only 85% predictive of sex, then you would never achieve 90% accuracy. With a real project you have many more variables and at best only a rough idea on how they might correlate with the target variable or each other in advance, so you cannot do the calculation.

I think instead it is more productive to look at your reason for wanting a theory to choose your dataset size:

Such a theory would also help me to gather data until I reach that point and then do some productive work.

Sadly you cannot do this theoretically. However, you can do a few useful things:

Plot a learning curve against data set size

I'd recommend this as your approach here. The driving question behind your question is: Will collecting more data improve my existing model?

Using the same cross-validation set each time, train your model with increasing amounts of data from your training set. Plot the cross-validated accuracy against number of training samples, up to the whole training set that you have so far.

If the graph has an upward slope all the way to end, then this implies that collecting more data will improve accuracy for your current model.
If the graph is nearly flat, with accuracy not improving towards the end, then it is unlikely that collecting more data will help you.

This does not tell you how much more data you need. An optimistic interpretation could take the trend line over the last section of the graph and project it to where it crosses your target accuracy. However, normally the returns for more data will become less and less. The training curve will asymptotocally approach some maximum possible accuracy for the given dataset and model. What plotting the curve using the data you have does is allow you to see where you are on this curve - perhaps you are still in the early parts of it, and then adding more data will be a good investment of your time.

Reassess your features and model

If your learning curve is not promising, then you need to look in more detail. Here are some questions you can ask yourself, and maybe test, to try and progress.

Your features:
- Are there more or different features that you could collect, instead of focussing on collecting more of the same?
- Would some feature engineering help - e.g. is there any theory or domain expert knowledge from the problem that you can turn into a formula and express as a new feature?
Your model:
- Are there any hyper-parameters you can tune to either get more out of the existing data, or improve the learning curve so that is worth going back to get more data?
- Would an entirely different model help? Deep learning models are often top performers only when there is a lot of data, so you might consider switching to a deep neural network and plotting a learning curve for it. Even if the accuracy on your current dataset is worse, if the learning curve shows a different model type might have the capacity to go further, it might be worth it.

Do note however, that you could just end up with the same maximum accuracy as before, after a lot of hope and effort. Unfortunately, this is hard to predict, and you will need to make careful decisions about how much of your time is worth sinking into solving the original problem.

Check confidence limits to choose a minimum test dataset size

Caveat: This is a guide to thinking about data set sizes, especially test set sizes. I have never known anyone use this to actually select some ideal data set size. Usually it happens the other way around, you have some size of test data set made available to you, and you want to understand what that tells you about your accuracy measurements.

You could determine a test set size that gives you reasonable confidence bounds on accuracy. That will mean, when you measure your 90% accuracy (or better) that you can be reasonably certain that the true accuracy is close to it. You can do this using confidence intervals on the accuracy measure.

As an example from the above link, you could measure a 92% accuracy on your test set, and want to know whether you are confident in that result. let's say you want to be 95% certain that you really do have accuracy > 0.9 . . . how should you choose N, the size of your test set?

You know that you are 0.02 over the desired accuracy by measurement, and you want to know if this enough that you can claim to be certain that you have 90% accuracy:

$$0.02 > 2 \sqrt{\frac{0.92 \times 0.08}{N}}$$

Therefore you need

$$N > \frac{0.92 \times 0.08}{0.0001}$$

$$N > 736$$

This is the minimum test data set size that would give you confidence that you have met your target of 90% accuracy, provided that

you have actually measured 92% or higher accuracy
you have selecting test data at random from the target population
that you have not used the test data set to select a model (by e.g. doing this test multiple times until you got a good result)

Typically you don't work backwards like this to figure N for a specific accuracy, but it is useful to understand the limits of your testing. You should generally consider how the size of the test dataset limits the accuracy by which you can confirm your model.

The formula above also has limitations when measuring close to 100% accuracy, and this is because the assumptions behind it fail - you would need to switch to more complex methods, perhaps a Bayesian approach, to get a better feel for what such a result was telling you, especially if the test sample size was small.

After you have established a minimum test dataset size, you could use that to guide data collection. For instance, the typical train/cv/test dataset might be 60/20/20, so with your result above you could choose an overall dataset size of 5 times 736, let's round up and call it 4000. In general this sets a lower bound on the size of dataset, as it says nothing about how hard it would be to learn a specific accuracy.

他のヒント

There are no general rules such that n features, m observations with a X type learner give q accuracy.

Your predictor depends on features in your model.

Let's say you wanna predict volume of trade between two countries. And lets say (hypothetically) this trade regulated and its the only variable. If you add features that can explain regulation changes your model will be quite good with few features. However if you dont add those regulation based features and add hundreds of financial variables, you model wont be good that much even you have lost of predictors.

I hope you got the point. Your model performance depends on relevance of features and training sample that can reflect what you really predict.

ライセンス： CC-BY-SA と帰属

所属していません datascience.stackexchange