What's the difference between two concatentated bz2 files and one bz2 file made from two concatenated files?

StackOverflow https://stackoverflow.com/questions/14720004

  •  06-03-2022
  •  | 
  •  

質問

If I have two text files, one and two, what's the difference between:

bz2 one two -c >out.bz2

...and...

cat one two | bzip2 -c >out.bz2

?

Specifically, I'm generating bz2 files using pbzip2, putting them on HDFS, then reading them from pig, and I'm hitting MAPREDUCE-477. I can't upgrade my hadoop cluster from version 0.20, using a non-parallel bz2 implementation is too slow and I want to use a non-block compression algorithm.

Is there any way I can convert a concatenated bz2 file into a non-concatenated one? Or even, how would I modify pbzip2 so it generates non-concatenated bz2 files?

Thanks -

役に立ちましたか?

解決

Often compression works by replacing patterns with something shorter. For example, if you have "Hello there, goodbye there" then you might replace the second "there" with a reference to the first (where the reference is smaller than the original 5 bytes).

Now imagine if you have 2 files, one that contains "Hello there" and another that contains "Goodbye there". If you concatenate then compress, then the compression has more data to work with and can replace the second "there" with a reference to the first. If you compress both files separately and then concatenate this can't happen.

Now imagine if you concatenate then compress, such that the second "there" (from the second file) is replaced with a reference to the first "there" (from the first file); and then try to split the compressed data back into 2 compressed files. What you'd end up with is a 2 files where the second file has a reference to something that doesn't exist in that file, which can't be decompressed.

Note: Modern compression techniques are a lot more complex than what I described above - I oversimplified a lot to illustrate.

If you need to compress and decompress a large amount of data in parallel, then it can't be done. Instead you need to split the large amount of data into small pieces; so that small piece can be compressed/decompressed separately and many small pieces can be compressed/decompressed in parallel.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top