Question

This two-stage pig processing works:

my_out = foreach (group my_in by id) {
  grouped = BagGroup(my_in.(keyword,weight),my_in.keyword);
  generate
    group as id,
    CountEach(my_in.domain) as domains,
    grouped as grouped;
};
my_out1 = foreach my_out {
  keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
  generate id, domains, keywords;
};

however, when I combine them:

my_out = foreach (foreach (group my_in by id) {
  grouped = BagGroup(my_in.(keyword,weight),my_in.keyword);
  generate
    group as id,
    CountEach(my_in.domain) as domains,
    grouped as grouped;
  }) {
    keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
    generate id, domains, keywords;
  };

I get an error:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " <IDENTIFIER> "generate "" at line 1, column 5.

My questions are:

  1. How do I avoid this error?
  2. Does it even make sense what I am trying to do? Even if I manage to accomplish this, will this save me an MR pass?
Était-ce utile?

La solution

In general, Pig's ability to parse complicated nested expressions is unreliable. Another common error when the nesting gets to be too much to handle is ERROR 1000: Error during parsing. Lexical error at line XXXX, column 0. Encountered: <EOF> after : ""

I often try to do this to avoid having to come up with a bunch of names for aliases that have no meaning except as intermediate steps in a computation. But sometimes it's not possible, as you have found out. My guess is that nesting a nested foreach is a no-go. But in your case, it looks like the first nested foreach is not necessary. Try this:

my_out = foreach (foreach (group my_in by id)
  generate
    group as id,
    CountEach(my_in.domain) as domains,
    BagGroup(my_in.(keyword,weight),my_in.keyword) as grouped
  ) {
    keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
    generate id, domains, keywords;
  };

As for your second question, no, this will make no difference to the eventual MR plan. This is purely a matter of Pig parsing your script; the map-reduce logic is unchanged by grouping the commands in this way.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top