Getting data into Hadoop

https://stackoverflow.com/questions/18189056

24-06-2022
|

Question

I come from a lot of SQL servers so it can be a bit difficult to picture exactly what happens to data when it goes into hadoop.

My understanding is that if you have a book in a text format that could be around 200k or so... you simply copy the data into hadoop and it becomes searchable. However does this data become part of a block so that HDFS can be more optimal or does it remain a 200k file in HDFS hurting performance?

Also is a Block what is often called a Tablet in Bigtable?

Thanks a lot for your help. FlyMario

Solution

A file which is less than the block size of HDFS (default 64 megabytes) becomes part of a block, yes. But small files such as these might still hurt your performance in some cases, such as if you have a lot of these small files and you run a MapReduce job on them.

Vanilla Hadoop has nothing to do with Bigtable, and HDFS blocks aren't really comparable with tablets. While Hadoop's HDFS blocks have no knowledge of the data they're holding, Bigtable tablets are data-aware.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow