Domanda

I'm trying to wrap the apache avro-tools (avro-tools-1.7.5.jar) concat utility to do globbing properly (as written, all files to be concatenated need to be written out fully). This utility is needed because a regular cat (or in my case hadoop fs -cat) operation will produce data with headers in the middle and cause errors in avro parsing utilities. I wrote a simple shell script named catavro.sh:

avrohdfs='hadoop jar /path/to/jar/avro-tools-1.7.5.jar'
DIRS=`hadoop fs -ls $1 | egrep '.avro' | awk '{print $8}'`
echo `$avrohdfs concat $DIRS -`  # dash (-) tells utility to print to stdout

This does stuff, but if I run bash catavro.sh [path to avro data on hdfs] > tmp.avro the result is a corrupted avro file. Errors get thrown when trying to read the schema, or do anything else. If I replace the last line in the shell script with:

echo $avrohdfs concat $DIRS -

to print the command only, and then run the resulting command on my terminal, storing the result into the same tmp.avro file, things work nicely. The filesize of tmp.avro is a bit bigger.

I'm pretty sure that the echo command is to blame here, it appears to be corrupting the binary data resulting from the avro-tools concat unitlity. Replacing the last line of the sh file with:

`$avrohdfs concat $DIRS - > tmp.avro`

yields the same (successful) result as returning the command and running it in the terminal. However, though I can write to files, it would be much better for me if I could return this to stdout so that it can be piped into a filtering utility I have.

What are the alternatives to echo? I have tried replacing the last line in the script with all of the following without success:

echo -E `$avrohdfs concat $DIRS -`
`$avrohdfs concat $DIRS -`
$($avrohdfs concat $DIRS -)
cat `$avrohdfs concat $DIRS -`
trap `$avrohdfs concat $DIRS -` 0 #sort of works, but data misses the ">tmp.avro" and spits into the termal

I have checked the PATH variable, it is the same in my terminal and in the shell script. Any help is much appreciated, thanks.

È stato utile?

Soluzione

The problem is that shell interprets parameters to echo and converts newlines to spaces and removes repeated spaces. You can either use double quotes to prevent that:

echo "`$avrohdfs concat $DIRS -`"

or, better, just run the command without any echo or anything:

$avrohdfs concat $DIRS -
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top