Pergunta

This two-stage pig processing works:

my_out = foreach (group my_in by id) {
  grouped = BagGroup(my_in.(keyword,weight),my_in.keyword);
  generate
    group as id,
    CountEach(my_in.domain) as domains,
    grouped as grouped;
};
my_out1 = foreach my_out {
  keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
  generate id, domains, keywords;
};

however, when I combine them:

my_out = foreach (foreach (group my_in by id) {
  grouped = BagGroup(my_in.(keyword,weight),my_in.keyword);
  generate
    group as id,
    CountEach(my_in.domain) as domains,
    grouped as grouped;
  }) {
    keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
    generate id, domains, keywords;
  };

I get an error:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " <IDENTIFIER> "generate "" at line 1, column 5.

My questions are:

  1. How do I avoid this error?
  2. Does it even make sense what I am trying to do? Even if I manage to accomplish this, will this save me an MR pass?
Foi útil?

Solução

In general, Pig's ability to parse complicated nested expressions is unreliable. Another common error when the nesting gets to be too much to handle is ERROR 1000: Error during parsing. Lexical error at line XXXX, column 0. Encountered: <EOF> after : ""

I often try to do this to avoid having to come up with a bunch of names for aliases that have no meaning except as intermediate steps in a computation. But sometimes it's not possible, as you have found out. My guess is that nesting a nested foreach is a no-go. But in your case, it looks like the first nested foreach is not necessary. Try this:

my_out = foreach (foreach (group my_in by id)
  generate
    group as id,
    CountEach(my_in.domain) as domains,
    BagGroup(my_in.(keyword,weight),my_in.keyword) as grouped
  ) {
    keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
    generate id, domains, keywords;
  };

As for your second question, no, this will make no difference to the eventual MR plan. This is purely a matter of Pig parsing your script; the map-reduce logic is unchanged by grouping the commands in this way.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top