Question

I'm trying to wrap the apache avro-tools (avro-tools-1.7.5.jar) concat utility to do globbing properly (as written, all files to be concatenated need to be written out fully). This utility is needed because a regular cat (or in my case hadoop fs -cat) operation will produce data with headers in the middle and cause errors in avro parsing utilities. I wrote a simple shell script named catavro.sh:

avrohdfs='hadoop jar /path/to/jar/avro-tools-1.7.5.jar'
DIRS=`hadoop fs -ls $1 | egrep '.avro' | awk '{print $8}'`
echo `$avrohdfs concat $DIRS -`  # dash (-) tells utility to print to stdout

This does stuff, but if I run bash catavro.sh [path to avro data on hdfs] > tmp.avro the result is a corrupted avro file. Errors get thrown when trying to read the schema, or do anything else. If I replace the last line in the shell script with:

echo $avrohdfs concat $DIRS -

to print the command only, and then run the resulting command on my terminal, storing the result into the same tmp.avro file, things work nicely. The filesize of tmp.avro is a bit bigger.

I'm pretty sure that the echo command is to blame here, it appears to be corrupting the binary data resulting from the avro-tools concat unitlity. Replacing the last line of the sh file with:

`$avrohdfs concat $DIRS - > tmp.avro`

yields the same (successful) result as returning the command and running it in the terminal. However, though I can write to files, it would be much better for me if I could return this to stdout so that it can be piped into a filtering utility I have.

What are the alternatives to echo? I have tried replacing the last line in the script with all of the following without success:

echo -E `$avrohdfs concat $DIRS -`
`$avrohdfs concat $DIRS -`
$($avrohdfs concat $DIRS -)
cat `$avrohdfs concat $DIRS -`
trap `$avrohdfs concat $DIRS -` 0 #sort of works, but data misses the ">tmp.avro" and spits into the termal

I have checked the PATH variable, it is the same in my terminal and in the shell script. Any help is much appreciated, thanks.

Was it helpful?

Solution

The problem is that shell interprets parameters to echo and converts newlines to spaces and removes repeated spaces. You can either use double quotes to prevent that:

echo "`$avrohdfs concat $DIRS -`"

or, better, just run the command without any echo or anything:

$avrohdfs concat $DIRS -
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top