C# parsing of Freebase RDF dump yields only 11.5 million N-Triples instead of 1.9 billion

Question 1

The Freebase RDF dumps are built by a map/reduce job that outputs 200 individual Gzip files. Those 200 files are then concatenated into one final Gzip file. According to the Gzip spec, concatenating the raw bytes from multiple Gzip files will produce a valid Gzip file. A library that adheres to the spec should produce a single file with concatenated content of each input file when uncompressing that file.

Based on the number of triples that you're seeing, I'm guessing that your code is only uncompressing the first chunk of the file and ignoring the other 199. I'm not much of a C# programmer but from reading another Stackoverflow answer it seems like switching to DotNetZip will solve this problem.

Question 2

I'm use DotNetZip and create decoration class GzipDecorator for "gzipped chunks" workaround.

sealed class GzipDecorator : Stream
{
    private readonly Stream _readStream;
    private GZipStream _gzip;
    private long _totalIn;
    private long _totalOut;

    public GzipDecorator(Stream readStream)
    {
        Throw.IfArgumentNull(readStream, "readStream");
        _readStream = readStream;
        _gzip = new GZipStream(_readStream, CompressionMode.Decompress, true);
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        var bytesRead = _gzip.Read(buffer, offset, count);
        if (bytesRead <= 0 && _readStream.Position < _readStream.Length)
        {
            _totalIn += _gzip.TotalIn + 18;
            _totalOut += _gzip.TotalOut;
            _gzip.Dispose();
            _readStream.Position = _totalIn;
            _gzip = new GZipStream(_readStream, CompressionMode.Decompress, true);
            bytesRead = _gzip.Read(buffer, offset, count);
        }
        return bytesRead;
    }
}

Question 3

I managed to solve the problem by repacking dump using "7-zip" archiver. Maybe it helps you.