First things first, I'm running Apache Pig version 0.11.0-cdh4.3.0 (rexported) according to the cluster. My build however uses 0.11.0-cdh4.5.0 which I know isn't a smart decision but I do not think it is related to the issue I am experiencing here since it's both Pig v0.11.0
I have a script which structurally looks like this (both custom udf's return the DataByteArray type, which is a valid Pig type afaik):
LOAD USING parquet.pig.ParquetLoader();
FOREACH GENERATE some of the fields
GROUP BY (a,b,c)
FOREACH GENERATE FLATTEN(group) AS (a,b,c), CustomUDF1(some_value) AS d
FOREACH GENERATE FLATTEN(CubeDimensions(a,b,c)) AS (a,b,c) , d
GROUP BY (a,b,c)
FOREACH GENERATE FLATTEN(group) AS (a,b,c), SUM(some_value), CustomUDF2(some_value)
STORE USING parquet.pig.ParquetStorer();
Pig splits this up in two mapreduce jobs. I'm not sure whether CubeDimensions happens in the first or in the second, but I suspect it happens in the reduce stage of the first job.
So the mapping stage of the second job does nothing more than reading the intermediate data, and that's where this happens :
"Unexpected data type 49 found in stream." @ org.apache.pig.data.BinInterSedes:422
I've seen the number be both 48 and 49 and neither exist in the BinInterSedes class :
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.pig/pig/0.11.0-cdh4.3.0/org/apache/pig/data/BinInterSedes.java?av=f
But since this is pig's own intermediate output, I don't quite get where it could have gone wrong. Both my custom UDF's return a valid type, and I would expect Pig to definitely store only using types it knows.
Any help would be greatly appreciated.