Domanda

I have a file which contains plaintext mixed in with some compressed text, for example:

Version 01
Maker SomeCompany

l 73
mark
h�22V0P���w�/�+Q0���L)�66□ // This line was compressed using DeflateZLib
endmark

It seems that Microsoft has a solution, the DeflateStream class, but their examples show how to use it on an entire file, whereas I can't figure out how to just use it on one line in my file.

So far I have the following:

bool isDeflate = false;

using (var fs = new FileStream(@"C:\Temp\MyFile.dat", FileMode.Open)
using (var reader = new StreamReader(fs))
{
     string line;
     while ((line = reader.ReadLine()) != null)
     {
         if (isDeflate)
         {
             if (line == "endmark")
             {
                 isDeflate = false;
             }
             else
             {
                 line = DeflateSomehow(line);
             }
         }

         if (line == "mark")
         {
             isDeflate = true;
         }

         Console.WriteLine(line);
     }
}

public string DeflateSomehow(string line)
{
    // How do I deflate just that string?
}

Since the file is not created by me (we're only reading it in), we have no control over its structure... but, I'm not tied down to the code I have right now. If I need to change more of it than simply figuring out how to implement the DeflateSomehow method, than I'm fine with that as well.

È stato utile?

Soluzione

A deflate stream works on binary data. An arbitrary binary chunk in the middle of a text file is also known as: a corrupt text file. There is no sane way of decoding this:

  • you can't read "lines", because there is no definition of a "line" when talking about binary data; any combination of CR/LF/CRLF/etc could occur completely by random in the binary data
  • you can't read a "string line", because that suggests you are running the data through an Encoding; but since this isn't text data, again: that will simply give you gibberish that cannot be processed (it will have lost data when reading)

Now, the second of these two problems is solvable by reading via the Stream API rather than the StreamReader API, so that you are only ever reading binary; you would then need to look for the line endings yourself, using an Encoding to probe what you can (noting that this isn't as simple as it sounds if you are using multi/variable-byte encodings such as UTF-8).

However, the first of these two problems is inherently not solvable by itself. To do this reliably, you would need some kind of binary framing protocol - which again, does not exist in a text file. It looks like the example is using "mark" and "endmark" - again, there is technically a chance that these would occur at random, but you'll probably get away with it for the 99.999% case. The trick, then, would be to read the entire file manually using Stream and Encoding, looking for "mark" and "endmark" - and stripping the bits that are encoded as text from the bits that are compressed data. Then run the encoded-as-text piece through the correct Encoding.

However! At the point when you are reading binary, then it is simple: you simply buffer the right amount (using whatever framing/sentinel protocol the data is written in), and use something like:

using(var ms = new MemoryStream(bytes))
using(var inflate = new GZipStream(ms, CompressionMode.Decompress))
{
    // now read from 'inflate'
}

With the addition of the l 73 marker, and the information that it is ASCII, it becomes a little more viable.

This won't work for me because the data here on SO is already corrupted (posting binary as text does that), but basically something like:

using System;
using System.Collections.Generic;
using System.IO;
using System.IO.Compression;
using System.Text;
using System.Text.RegularExpressions;
class Program
{
    static void Main()
    {
        using (var file = File.OpenRead("my.txt"))
        using (var buffer = new MemoryStream())
        {
            List<string> lines = new List<string>();
            string line;
            while ((line = ReadToCRLF(file, buffer)) != null)
            {
                lines.Add(line);
                Console.WriteLine(line);
                if (line == "mark" && lines.Count >= 2)
                {
                    var match = Regex.Match(lines[lines.Count - 2], "^l ([0-9]+)$");
                    int bytes;
                    if (match.Success && int.TryParse(match.Groups[1].Value, out bytes))
                    {
                        ReadBytes(file, buffer, bytes);
                        string inflated = Inflate(buffer);
                        lines.Add(inflated); // or something similar
                        Console.WriteLine(inflated);
                    }
                }
            }
        }

    }
    static string Inflate(Stream source)
    {
        using (var deflate = new DeflateStream(source, CompressionMode.Decompress, true))
        using (var reader = new StreamReader(deflate, Encoding.ASCII))
        {
            return reader.ReadToEnd();
        }
    }
    static void ReadBytes(Stream source, MemoryStream buffer, int count)
    {
        buffer.SetLength(count);
        int read, offset = 0;
        while (count > 0 && (read = source.Read(buffer.GetBuffer(), offset, count)) > 0)
        {
            count -= read;
            offset += read;
        }
        if (count != 0) throw new EndOfStreamException();
        buffer.Position = 0;
    }
    static string ReadToCRLF(Stream source, MemoryStream buffer)
    {
        buffer.SetLength(0);
        int next;
        bool wasCr = false;
        while ((next = source.ReadByte()) >= 0)
        {
            if(next == 10 && wasCr) { // CRLF
                // end of line (minus the CR)
                return Encoding.ASCII.GetString(
                     buffer.GetBuffer(), 0, (int)buffer.Length - 1);
            }
            buffer.WriteByte((byte)next);
            wasCr = next == 13;
        }
        // end of file
        if (buffer.Length == 0) return null;
        return Encoding.ASCII.GetString(buffer.GetBuffer(), 0, (int)buffer.Length);

    }
}
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top