Question

My hypothesis h depends on multiple categorical variables (a,b,c) each with their corresponding set of possible values (A,B,C). Now each of my data point exist in this space where I have no control over the values (observational data).

For e.g. Hypothesis to predict user shopping probability say depends on (Age, Country, Gender, Devicetype etc.)

How could I sample the above data set so that it would give me a good representation. The techniques I have learned from the books very well apply to one dimension but that is a rare case in practice. If I sample across one dimension my other dimensions will be heavily skewed towards some values. Is there any standard algorithm to give good sampling?

Was it helpful?

Solution

Let me give you some pointers (assuming that I'm right on this, which might not necessarily be true, so proceed with caution :-). First, I'd figure out the applicable terminology. It seems to me that your case can be categorized as multivariate sampling from a categorical distribution (see this section on categorical distribution sampling). Perhaps, the simplest approach to it is to use R ecosystem's rich functionality. In particular, standard stats package contains rmultinom function (link).

If you need more complex types of sampling, there are other packages that might be worth exploring, for example sampling (link), miscF (link), offering rMultinom function (link). If your complex sampling is focused on survey data, consider reading this interesting paper "Complex Sampling and R" by Thomas Lumley.

If you use languages other than R, check multinomial function from Python's numpy package and, for Stata, this blog post. Finally, if you are interested in Bayesian statistics, the following two documents seems to be relevant: this blog post and this survey paper. Hope this helps.

OTHER TIPS

To clarify, you have at least one observation in every possible category combination, but you only want to perform analysis on a subset of the total data, and are trying to decide how to choose which points to keep and which points to throw away?

I think the right approach here will depend strongly on what your hypothesis h is, what sort of statistical tests you want to run, and what your loss function is. If you're trying to answer a question which can be answered by the number of datapoints in each combination, for example, or by the mean and stdev of some continuous variable for each combination, reducing the size of your data by sampling will only hurt your analysis.

If you're trying to learn a classifier, for example, a classic question is whether to train on a set with equal numbers of all possible classes or with the underlying class distribution found in the wild. The first will train a "superior" classifier, especially if its prior on class membership is later reset to the actual distribution in the wild, by most reasonable loss functions. But is your loss function one of the ones where this is better?

You might also want to look into design of experiments, combinatorial design in particular, which is trying to solve a symmetrical problem--starting with no data but being able to choose the various values, what set of points should we test to get as much information as possible about the underlying functions?

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top