Is it possible to recover corrupted Zlib data beyond the corrupted section?

https://stackoverflow.com/questions/9350725

30-04-2021
|

Question

I have a Zip archive with a large (important) file that will not extract. All Zip utilities that I've tried, including those that claim to recover/fix broken Zip archives are unable to extract the file containing the corrupted zlib compressed data. They get all the files in the archive except for the damaged entry, which gets skipped.

I've written a small utility app in C# that parses the zip archive, identifies each entry and parses out the fields, decrypts the data sections, and then decompresses them using a DeflateStream (from a .Net implementation of zlib). Everything works fine until I get to the damaged entry. The damaged entry decrypts successfully and fully (using AES in CTR mode), but the DeflateStream reader is only able to get through about 40MB into the decrypted data before throwing "Bad state (oversubscribed dynamic bit lengths tree)".

Is it possible to somehow 'seek' past the damaged section and continue decompressing the data? I'd like to recover as much of the file as possible, even if there are some holes. The DeflateStream doesn't implement a Seek method, and if I attempt to create a new DeflateStream with the underlying FileStream positioned to the last read position, it throws the same "Bad State" exception.

Solution

The deflate protocol adapts the tables based on a sliding window and a forward-stream.

This isn't a block -- each section isn't standalone unit of data so there is no way to just "skip" a bad block -- but rather a moving "view" of the streams used to compute/restore the table information. That being said, my simple conclusion: not realistically possible :(

See An Explanation of the "Deflate" Protocol.

Less-than-grumpy data restorations.

The deflate protocol actually has 'blocks' which allow switching between the compressor used. However, I would doubt they are usable for any sort of recovery. It is a far cry from MPEG-4, which is actually rather recoverable.

From RFC 1951, which shows how complicated it would be:

Note that the header bits do not necessarily begin on a byte boundary, since a block does not necessarily occupy an integral number of bytes.

And talking about the LZ77 compressor able to span the individual blocks: (It is the entirety of the stream within a window that is used to build the compression information).

Note that a duplicated string reference may refer to a string in a previous block; i.e., the backward distance may cross one or more block boundaries.

However, there is a hint of hope:

The compressor terminates a block when it determines that starting a new block with fresh trees would be useful, or when the block size fills up the compressor's block buffer.

Not a ray of light I would chase :-/ Perhaps individual bit-seeking (not all bit sequences are valid block starts) and then running the algorithm until it failed (provably "invalid" deflate), then backing off and trying again until such a starting-bit didn't generate invalid states within some 'acceptable' period.

This method would require modifications to a deflate protocol engine -- it is not a task a normal deflate stream processor could handle.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow