I'm trying to wrap the apache avro-tools (avro-tools-1.7.5.jar) concat utility to do globbing properly (as written, all files to be concatenated need to be written out fully). This utility is needed because a regular cat
(or in my case hadoop fs -cat
) operation will produce data with headers in the middle and cause errors in avro parsing utilities. I wrote a simple shell script named catavro.sh
:
avrohdfs='hadoop jar /path/to/jar/avro-tools-1.7.5.jar'
DIRS=`hadoop fs -ls $1 | egrep '.avro' | awk '{print $8}'`
echo `$avrohdfs concat $DIRS -` # dash (-) tells utility to print to stdout
This does stuff, but if I run bash catavro.sh [path to avro data on hdfs] > tmp.avro
the result is a corrupted avro file. Errors get thrown when trying to read the schema, or do anything else. If I replace the last line in the shell script with:
echo $avrohdfs concat $DIRS -
to print the command only, and then run the resulting command on my terminal, storing the result into the same tmp.avro file, things work nicely. The filesize of tmp.avro is a bit bigger.
I'm pretty sure that the echo
command is to blame here, it appears to be corrupting the binary data resulting from the avro-tools concat unitlity. Replacing the last line of the sh file with:
`$avrohdfs concat $DIRS - > tmp.avro`
yields the same (successful) result as returning the command and running it in the terminal. However, though I can write to files, it would be much better for me if I could return this to stdout
so that it can be piped into a filtering utility I have.
What are the alternatives to echo
? I have tried replacing the last line in the script with all of the following without success:
echo -E `$avrohdfs concat $DIRS -`
`$avrohdfs concat $DIRS -`
$($avrohdfs concat $DIRS -)
cat `$avrohdfs concat $DIRS -`
trap `$avrohdfs concat $DIRS -` 0 #sort of works, but data misses the ">tmp.avro" and spits into the termal
I have checked the PATH variable, it is the same in my terminal and in the shell script. Any help is much appreciated, thanks.