Question

I'm working with some data in Pig that includes strings of interest, optionally separated by semicolons and in random order, e.g.

test=12345;foo=bar
test=12345
foo=bar;test=12345

The following code should extract the value of the string for the test 'key':

blah =
  FOREACH
    data
  GENERATE
    FLATTEN (
      EXTRACT (
        str_of_interest,
        'test=(\\S+);?'
      )
    )
    AS (
      test: chararray
    )
  ;

However, when running the code, I encounter the following error:

<line 46, column 0>  mismatched character '<EOF>' expecting '''
2013-04-16 04:46:05,245 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 46, column 0>  mismatched character '<EOF>' expecting '''

I thought I had my regex escape syntax off at first, but that doesn't appear to be the problem. The only information I get from a Google search is a bug report that appears to have been recently fixed, but it's still an issue on the Amazon EMR cluster I'm running (spun up ad hoc, just now, for this analysis).

As in the bug report and as suggested elsewhere, replacing the semicolon with its Unicode equivalent (\u003B) yields the same error.

I could be crazy and this could be a syntax issue, so I'm hoping someone might be able to point me in the right direction or confirm that this is an existing problem. If the latter, are there any workarounds (either in Pig, or for matching the string I want)?

Cheers

Was it helpful?

Solution

This is a bug which will be fixed in 0.12 (see http://issues.apache.org/jira/browse/PIG-2507)

If you can't change the delimiter or wait for the new version to be released (on EMR this can take longer than the actual Apache release), I'd Implement my own UDF and hardcode the regular expression in some way. You can use RegexExtract as a starting point.

Obviously you could also build your own version of pig by applying the patch but I guess that's a bit more complicated.

OTHER TIPS

It looks like you are using Amazon's String Manipulation and DateTime Functions For Pig, since EXTRACT() isn't a built-in function.

Try switching over to using the built-in function REGEX_EXTRACT_ALL()

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top