Incrementally updating data on HDFS

https://stackoverflow.com/questions/22802024

25-06-2023
|

Question

We are copying data from various sources such as Oracle, Teradata to HDFS using Sqoop. We use incremental update feature to 'import' new data & then 'merge' it with the existing data. Data first gets populated in a temporary directory & then we 'remove' the old & 'rename' the new one.

Problem is, if a user is running a query against the data on HDFS using a tool such as Hive while we swap the directory, the query terminates abnormally.

Is there a better way to handle the updates on HDFS?

(Please note, that even though HBase keeps different versions, it doesn't work for us because we want to query by any column. HBase is very slow in cases where you don't search by primary key.)

Solution

Hadoop is not designed to work like that. It is good for storing data but not editing. I would just add new data beside old data and while adding it(copying or any other import) you could add sufix .tmp to filename. But i did not use hive that much(pig user here) and in pig i could tell A = LOAD '/some/path/to/hdfs/*.log' and that would load all files except .tmp which are importing. With that there is no problems.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow