Question

Does anyone have an idea of how to make a stratified sampling in pig? (wikipedia)

For the moment, I do something like :

relation2 = SAMPLE relation1 0.05;

but my dataset contains a label columns with a few occurrences, some of them are rare (0.5 % for example) and I would like my random down sampling not to forget all of them.

Thanks a lot.

Was it helpful?

Solution

You could implement your own method of sampling by using RANDOM() and then filtering out rows with values below, say, 0.95. So, if you want to stratify this sampling, you could compute what fraction of your rows contain a certain value, and then scale your random value accordingly so that different values get sampled at different rates.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top