stratified sampling in pig?

https://stackoverflow.com/questions/20909755

hadoop
downsampling
apache-pig
sampling

24-09-2022
|

Question

Does anyone have an idea of how to make a stratified sampling in pig? (wikipedia)

For the moment, I do something like :

relation2 = SAMPLE relation1 0.05;

but my dataset contains a label columns with a few occurrences, some of them are rare (0.5 % for example) and I would like my random down sampling not to forget all of them.

Thanks a lot.

Solution

You could implement your own method of sampling by using RANDOM() and then filtering out rows with values below, say, 0.95. So, if you want to stratify this sampling, you could compute what fraction of your rows contain a certain value, and then scale your random value accordingly so that different values get sampled at different rates.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow