Sampling 1000 lines from a bunch of gzipped files with PIG

https://stackoverflow.com/questions/22796967

25-06-2023
|

Вопрос

I'm very new to Pig so I may be going about this the wrong way. I have a bunch of gzipped files in a directory in Hadoop. I'm trying to sample around 1000 lines from all of these files put together. It doesn't have to be exact, so I wanted to use SAMPLE. SAMPLE needs a probability of sampling a line, rather than the number of lines that I need, so I thought I should count up the number of lines among all these files and than simply divide 1000 by that count and use it as the probability. This will work, since I don't need to have exactly 100 lines at the end. Here is what I got so far:

raw = LOAD '/data_dir';
cnt = FOREACH (GROUP raw ALL) GENERATE COUNT_STAR(raw);
cntdiv = FOREACH cnt GENERATE (float)100/ct.$0;

Now I'm not sure how to use the value in cntdiv in SAMPLE. I tried SAMPLE raw cntdiv and SAMPLE raw cntdiv.$0, but they don't work. Can I even use that value in the call to SAMPLE? Maybe there is a much better way of accomplishing what I'm trying to do?

Решение

Check out the description in the ticket originally requesting this feature: https://issues.apache.org/jira/browse/PIG-1926

I haven't tested this, but it looks like this should work:

raw = LOAD '/data_dir';
samplerate = FOREACH (GROUP raw ALL) GENERATE 1000.0/COUNT_STAR(raw) AS rate;
thousand = SAMPLE raw samplerate.rate;

The important thing is to refer to your scalar by name (rate), not by position ($0).

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow