bulk-load: don't import duplicate values

https://stackoverflow.com/questions/12008201

26-06-2021
|

Question

I need to import everyday a file containing the yesterday's snapshot of a database. To import I use the following command in the shell:

./bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv \
    '-Dimporttsv.separator=|' \
    -Dimporttsv.columns=HBASE_ROW_KEY,info:date,info:author,info:text \
    tableName \
    inputFile.tsv

The problem is that each line contains all the values and not just the updated ones, resulting to have several versions for each column but with the same value.

There is any other way to import this daily snapshot ignoring the duplicate values? Or any suggestion to workaround this?

Thank you!

Solution

I guess that if you really want to ignore existing values you'd need to write your own map/reduce instead of using the import program.

However, what's the problem with multiple versions? First off you can set the number of version hbase keeps (when you define a column family) secondly when you read you can read just the latest version and lastly, if you are worried about storage you can set up hbase to use compression

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow