Pig cannot read its own intermediate data

https://stackoverflow.com/questions/23653762

22-07-2023
|

Question

First things first, I'm running Apache Pig version 0.11.0-cdh4.3.0 (rexported) according to the cluster. My build however uses 0.11.0-cdh4.5.0 which I know isn't a smart decision but I do not think it is related to the issue I am experiencing here since it's both Pig v0.11.0

I have a script which structurally looks like this (both custom udf's return the DataByteArray type, which is a valid Pig type afaik):

LOAD USING parquet.pig.ParquetLoader();

FOREACH GENERATE some of the fields

GROUP BY (a,b,c)

FOREACH GENERATE FLATTEN(group) AS (a,b,c), CustomUDF1(some_value) AS d

FOREACH GENERATE FLATTEN(CubeDimensions(a,b,c)) AS (a,b,c) , d

GROUP BY (a,b,c)

FOREACH GENERATE FLATTEN(group) AS (a,b,c), SUM(some_value), CustomUDF2(some_value)

STORE USING parquet.pig.ParquetStorer();

Pig splits this up in two mapreduce jobs. I'm not sure whether CubeDimensions happens in the first or in the second, but I suspect it happens in the reduce stage of the first job.

So the mapping stage of the second job does nothing more than reading the intermediate data, and that's where this happens :

"Unexpected data type 49 found in stream." @ org.apache.pig.data.BinInterSedes:422

I've seen the number be both 48 and 49 and neither exist in the BinInterSedes class :

http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.pig/pig/0.11.0-cdh4.3.0/org/apache/pig/data/BinInterSedes.java?av=f

But since this is pig's own intermediate output, I don't quite get where it could have gone wrong. Both my custom UDF's return a valid type, and I would expect Pig to definitely store only using types it knows.

Any help would be greatly appreciated.

Solution

It appears that by coincidence the sequence that is used for line splitting in Pig's intermediate storage, also occurs in one of the byte arrays that are returned by the custom UDFs. This causes pig to break up the line somewhere in the middle, and start looking for a datatype indication. Since it's just in the middle of the line, there is no valid data type indication, hence the error.

I'm not entirely sure yet how I am going to go about fixing this. @WinnieNicklaus already provided a good solution by splitting the script up in two and storing in between. Another option would be to have the UDF return a Base64 encoded byte array. That way there can never be a conflict with the PIG intermediate storage, since it uses CTRL-A, CTRL-B, CTRL-C, TUPLE-INDICATOR, none of which are alphanumerical characters.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow