Pregunta

I am using AvroStorage like this:

STORE alias INTO '$OUTPUT'
    USING org.apache.pig.piggybank.storage.avro.AvroStorage('{
    "index" : 1,
    "schema_uri": "file://path/schema.avsc"}');

so, it is explicit to take the schema.avsc from the local file system, not HDFS.

It works in a pseudo-distributed cluster, but fails on a normal cluster with java.io.FileNotFoundException for the schema file Looks like this is happening in backend.

I assume this is because the backend invocation of AvroStorage on a node, different from the node I am running the pig script from, cannot find the file in the local file system. Why can't it use the schema file from front-end invocation? Does it mean that I am only limited to either HDFS locations for schema_uri or using embedding the schema string in AvroStorage parameters?

¿Fue útil?

Solución

It turned out to be a limitation of the AvroStorage from piggybank: http://www.mail-archive.com/user%40pig.apache.org/msg09000.html

For now I am using this workaround:

%declare WORK_DIR `pwd`
%declare SCHEMA_LITERAL `cat $WORK_DIR/schema.avsc`

...

STORE inputs INTO 'output'
    USING com.magnetic.org.apache.pig.piggybank.storage.avro.AvroStorage('{
    "index" : 1,
    "schema": $SCHEMA_LITERAL}');
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top