Example for non-iid data

https://stackoverflow.com/questions/13058379

14-07-2021
|

Pergunta

I have read some papers regarding to non-iid data. Based on Wikipedia, I know what iid (independent and identical distributed) data is but am still confused about non-iid. I did some research but could not find a clear definition and example of it. Can someone help me on this?

Solução

From wikipedia iid:

"Independent and identically distributed" implies an element in the sequence is independent of the random variables that came before it. In this way, an IID sequence is different from a Markov sequence, where the probability distribution for the nth random variable is a function of the previous random variable in the sequence (for a first order Markov sequence).

As a simple synthetic example, assume you have a special dice with 6 faces. If the last time the face value is 1, next time you throw it, you will still get a face value of 1 with 0.5 probability and a face value of 2,3,4,5,6 each with 0.1 probability. However, if the last time the face value is not 1, you get equal probability of each face. E.g.,

p(face(0) = k) = 1/6, k = 1,2,3,4,5,6  -- > initial probability at time 0. 
p(face(t) = 1| face(t-1) = 1) = 0.5, p(face(t) = 1| face(t-1) != 1) = 1/6
p(face(t) = 2| face(t-1) = 1) = 0.1, p(face(t) = 1| face(t-1) != 1) = 1/6
p(face(t) = 3| face(t-1) = 1) = 0.1, p(face(t) = 1| face(t-1) != 1) = 1/6
p(face(t) = 4| face(t-1) = 1) = 0.1, p(face(t) = 1| face(t-1) != 1) = 1/6
p(face(t) = 5| face(t-1) = 1) = 0.1, p(face(t) = 1| face(t-1) != 1) = 1/6
p(face(t) = 6| face(t-1) = 1) = 0.1, p(face(t) = 1| face(t-1) != 1) = 1/6
face(t) stands for the face value of t-th throw.

This is an example when the probability distribution for the nth random variable (the result of the nth throw) is a function of the previous random variable in the sequence.

I see Non-identical and Non-independent (e.g, Markovian) data in some machine learning scenarios, which can be thought of as non-iid examples.

Online learning with streaming data, when the distribution of the incoming examples changes over time: the examples are not identically distributed. Suppose you have a learning module for predicting the click-thru-rate of online-ads, the distribution of query terms coming from the users are changing during the year dependent on seasonal trending. The query terms in summer and in Christmas season should have different distribution.
Active learning, where labels for specific data are requested by the learner: the independence assumption is also violated.
Learning / making inference with graphical models. Variables are connected thru dependence relations.

Outras dicas

In a very hand-wavy way (since I assume you've read the technical definition), i.i.d. means if you have a bunch of values, then all permutations of those values have equal probability. So if I have 3,6,7 then the probability of this is equal to the probability of 7,6,3 is equal to 6,7,3 etc. That is, each value has no dependence on other values in the sequence.

As a counter example, imagine the sequence x where each element x_i is either one higher or one lower than the preceding element, with a 50-50 chance as to which of these happens. Then one possible sequence is 1,2,3,2,3,4,3,2. It should be clear that there are some permutations of this sequence that are not equiprobable: in particular, sequences starting 1,4,... have probability zero. You can instead consider pairs of the form x_i | x_i-1 to be iid if you wish.

Here's an example of a problem that is not independent. Problem definition: Suppose you have a 2D image a blob in it. You want to build a patch classifer that operates with 5X5 image patches as input and classifies the center pixel as "boundary" or "not boundary." Your requirement is that the resulting classifications of every patch define a continuous contour (one pixel thick) that traces around the border of the blob accurately. Essentially, an edge detector. Also assume that a slight error of misplacing the boundary by just a few pixels does not matter, however the continuity of the boundary contour does matter (it should not have any breaks).

How this is not independent: Example1: Suppose you have a good solution contour A. Another valid solution, B, which is just A shifted over to the right by 2 pixels, note that most of the classifications at the pixel level are different but the solution is still valid. Example2: Suppose you get valid solution A except that only one output pixel is shifted right by 2 pixels to create output C. This time you have a broken contour and the solution is not valid. This demonstrates how the classifier needs to know about the answers to other nearby pixel examples in order to determine if a particular pixel should be classified as boundary or not.

Literally, non iid should be the opposite of iid in either way, independent or identical.

So for example, if a coin is flipped, let X is the random variable of event that result is tail, Y is the random variable of event the result is head, then X and Y are definitely dependent. They can be decided by each other.

As to the non identical, once the distributions of two random variables are not the same, they can be called non-identical.

Therefore, either of the situations happen, you may get an example of non iid case.

"iid" is actually not a property of real data but an assumption that the observer has about this data. If you replace every mention of "iid data" and "non-iid data" in greeness' answer with "the assumption of [...] data", then I fully agree with them.

As the question specifically asks for an example of non-iid data, however, it must be added that there is no such data because you can take literally ANY data and assume that it is iid or not iid. This assumption is but a helpful limitation of options that makes statistic modelling easier or even just possible in many cases.

This is not made very clear in the literature where iid is often presented as a property of real-world data. This is understandable but still a bit dangerous as it implicitly assumes that we, as the observers, can determine information about the source (i.e., generating process) of data where in fact, we cannot.

Of course, this only applies to real-world data. Anyone can generate artificial data according to some distribution which necessarily leads to data that is iid. But this cannot be determined without knowledge about the generating process beyond the factual data. Given only data, there is no way to say whether it is iid or not.

I know there are methods that assume that data is non-iid and try to find different distributions accordingly. In many cases, this helps because the data is actually generated by a nonstationary stochastic process. But the data only gives you samples and the inductive reasoning necessary to determine whether it actually is iid or not is always prone to error.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow