IID violation in machine learning

https://datascience.stackexchange.com/questions/10585

16-10-2019
|

Question

Imagine I'm collecting some training data. Lets say I collect a 30minute time series from 1000 people so I have 1000 observations (rows) in my feature matrix. I train some model (lets say a neural net for this example) and I find my AUC is really poor and I believe the problem to be the fact that I only have 1000 observations - so I don't have enough data

However, I am now unable to collect anymore data. One thing I could do, is take that 30minute time series, and slice it up into 30 1-minute sections. Then use these 1 minute series as rows in my data. So I would end up with 30 observations per person, and 1000 people, giving me 30000 rows in my feature matrix. I've now increased the size of my training set by 30x

If I were to do this when doing statistical/inferential tests I would be violating the assumption of independence, and I'd be forced to model this with some multilevel model to correct

Is this the same as the IID assumption in machine learning, where each of your rows must be independent of each other?

For inferential tests, (one of) the reason why this assumption matters is because it affects the Type 1 error of your inference. However, in machine learning we're not doing any inferential tests, so what effect does violating IID actually have on results? In other words, why do rows need to be independent of each other? Especially in a case like above where I can drastically increase the size of my training set by reusing different parts of 1 person's data

Solution

Suppose you are investigating if heart rate can predict if a person smokes. You measure bpm for 30x1m consecutive times, and ask if the person smokes in order to build your training model data set.

What would contribute to a better predictor? 30 observations from a person who smokes, or 1 observation from 30 people who smoke? Given that one person's heart rate won't change much over 30m, it seems clear that you'd rather have 1 sample from 30 people than 30 samples from one person. 30 samples from one person are not worth as much as 1 sample from 30 people.

Because the samples from one person are not independent.

I think if you put non-independent samples into a neural net then it won't affect the predictive power too much as long as the non-independence is similar across all your training data. In one extreme, if all smokers and non-smokers have the same heart rate over the 30m period, then all you've done is repeated your input data precisely 30 times and nothing will be changed (except it will take 30x as long to run...).

However, if smokers' heart rates are constant, and non-smokers' vary, then you add 30 measurements of each smoker's heart rate to your model, and a bunch of random measurements correlating those rates with non-smoking. Your NN is very likely to predict anyone with one of those smokers' heart rates is a smoker. This is clearly wrong - those 30 measurements from each smoker are only worth one measurement, and putting them in a NN will train the network wrongly.

OTHER TIPS

I don't feel secure enough to give a definitive answer but this described situation arises in phoneme classification when the data is split up in arbitray small parts. Here even though it is the same problem it does not cause any problems I know of.

So I would just give it a try violating this assumption and just see if it works. This approach is often used in machine learning. For example Naive Bayes is sometimes used when the training data does not behave like a diagonal covariance gaussian etc.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange