Question

In the paper The Google File System Section 2.3, it says:

Files are divided into fixed-sizechunks.

But don't tell why. What's the advantage for that?

Was it helpful?

Solution

For what I know, there are several reasons

  1. Files stored in GFS are very large, even to PB, there is no such big disk to store it.
  2. Instead of mutable size, fixed-size chunks are easy for indexing and querying.
  3. Actually, the size of each chunk is not small, around 64MB, also a big size, in this way, it can reduce the number of metadata needed by GFS.

OTHER TIPS

Ease of replication. It is easier to replicate several chunks compared to the entire file. If any error occurs during the replication, only the failed chunk needs to be copied again.

Balance server loading. Both reading and writing operation can be separated among all chunk servers.

Enhance throughput for both reading and writing. Both reading and writing throughput can be enhanced since hundreds of server can serve the requests simultaneously. Application gets the metadata of chunks of a file from the master server then gets those chunks from chunk servers directly.

Better of disk utilization. If your files tend to be large then a chunk and disks have only a few space, it is easier to find enough space for a chunk rather then the entire file.

Ease of integrity check. Compute the checksum of a chunk is faster than an entire file. When a corrupted chunk is detected, it is easier to fix the chunk instead of entire file as well.

This concept seems to be exactly as followed by the underneath OS and also by the DBMS, where they use fixes sized pages/blocks for virtual memory and also for data placement on the disk. Having fixed sized blocks helps with fragmentation, which means space would be unused if a file is removed and thus making it very hard to reuse it, and so the size of the block is also kept small. Here GFS is actually used only for post processing and so there isnt a lot of deletes. But having small fixed size blocks makes it very easy to run map reduce jobs on them as well.

This way the client can ask for particular blocks knowing that each size can only by 64mb in size and so can also make better use of caching.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top