Question

I am using Apache PIG to reduce data originally stored in CSV format and want to output in Avro. Part of my PIG script calls a java UDF that appends a few fields to the input Tuple and passes the modified Tuple back. I am modifying the output, PIG, schema when doing this using:

Schema outSchema = new Schema(input).getField(1).schema;
Schema recSchema = outSchema.getField(0).schema;
recSchema.add(new FieldSchema("aircrafttype", DataType.CHARARRAY));

Inside the public Schema outputSchema(Schema input) method of my UDF.

Within the exec method, I append java.lang.String values to the input Tuple and return the edited Tuple to the PIG script. This, and all subsequent operations work fine. If I output to CSV format using PigStorage(',') there are no problems. When I attempt to output using

STORE records INTO '$out_dir' USING org.apache.pig.piggybank.storage.avro.AvroStorage('
{
"schema":{ 
  "type":"record", "name":"my new data",
  "fields": [
    {"name":"fld1", "type":"long"},
    {"name":"fld2", "type":"string"}
  ]}
}');

I get the following error:

java.io.IOException: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.avro.util.Utf8

I have attempted appending the character fields to the Tuple (within my UDF) as char[] and Utf8 types, but that makes PIG angry before I even get to trying to write out data. I have also attempted modifying my Avro schema to allow for null types in every field.

I'm using PIG v0.11.1 and Avro v1.7.5, any help is much appreciated.

Was it helpful?

Solution

This was a PIG version issue. My UDF was built into a jar-with-dependencies including PIG v0.8.1. The mix of PIG versions 0.8.1 and 0.11.1 was causing the problems, AVRO had nothing to do with it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top