Question

I need to import everyday a file containing the yesterday's snapshot of a database. To import I use the following command in the shell:

./bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv \
    '-Dimporttsv.separator=|' \
    -Dimporttsv.columns=HBASE_ROW_KEY,info:date,info:author,info:text \
    tableName \
    inputFile.tsv

The problem is that each line contains all the values and not just the updated ones, resulting to have several versions for each column but with the same value.

There is any other way to import this daily snapshot ignoring the duplicate values? Or any suggestion to workaround this?

Thank you!

Was it helpful?

Solution

I guess that if you really want to ignore existing values you'd need to write your own map/reduce instead of using the import program.

However, what's the problem with multiple versions? First off you can set the number of version hbase keeps (when you define a column family) secondly when you read you can read just the latest version and lastly, if you are worried about storage you can set up hbase to use compression

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top