C#.net identify zip file

https://stackoverflow.com/questions/11996299

26-06-2021
|

Question

I am currently using the SharpZip api to handle my zip file entries. It works splendid for zipping and unzipping. Though, I am having trouble identifying if a file is a zip or not. I need to know if there is a way to detect if a file stream can be decompressed. Originally I used

FileStream lFileStreamIn = File.OpenRead(mSourceFile);
lZipFile = new ZipFile(lFileStreamIn);
ZipInputStream lZipStreamTester = new ZipInputStream(lFileStreamIn, mBufferSize);// not working
lZipStreamTester.Read(lBuffer, 0, 0);
if (lZipStreamTester.CanDecompressEntry)
{

The LZipStreamTester becomes null every time and the if statement fails. I tried it with/without a buffer. Can anybody give any insight as to why? I am aware that i can check for file extension. I need something that is more definitive than that. I am also aware that zip has a magic #(PK something), but it isn't a guarantee that it will always be there because it isn't a requirement of the format.

Also i read about .net 4.5 having native zip support so my project may migrate to that instead of sharpzip but I still need didn't see a method/param similar to CanDecompressEntry here: http://msdn.microsoft.com/en-us/library/3z72378a%28v=vs.110%29

My last resort will be to use a try catch and attempt an unzip on the file.

Solution

This is a base class for a component that needs to handle data that is either uncompressed, PKZIP compressed (sharpziplib) or GZip compressed (built in .net). Perhaps a bit more than you need but should get you going. This is an example of using @PhonicUK's suggestion to parse the header of the data stream. The derived classes you see in the little factory method handled the specifics of PKZip and GZip decompression.

abstract class Expander
{
    private const int ZIP_LEAD_BYTES = 0x04034b50;
    private const ushort GZIP_LEAD_BYTES = 0x8b1f;

    public abstract MemoryStream Expand(Stream stream); 
    
    internal static bool IsPkZipCompressedData(byte[] data)
    {
        Debug.Assert(data != null && data.Length >= 4);
        // if the first 4 bytes of the array are the ZIP signature then it is compressed data
        return (BitConverter.ToInt32(data, 0) == ZIP_LEAD_BYTES);
    }

    internal static bool IsGZipCompressedData(byte[] data)
    {
        Debug.Assert(data != null && data.Length >= 2);
        // if the first 2 bytes of the array are theG ZIP signature then it is compressed data;
        return (BitConverter.ToUInt16(data, 0) == GZIP_LEAD_BYTES);
    }

    public static bool IsCompressedData(byte[] data)
    {
        return IsPkZipCompressedData(data) || IsGZipCompressedData(data);
    }

    public static Expander GetExpander(Stream stream)
    {
        Debug.Assert(stream != null);
        Debug.Assert(stream.CanSeek);
        stream.Seek(0, 0);

        try
        {
            byte[] bytes = new byte[4];

            stream.Read(bytes, 0, 4);

            if (IsGZipCompressedData(bytes))
                return new GZipExpander();

            if (IsPkZipCompressedData(bytes))
                return new ZipExpander();

            return new NullExpander();
        }
        finally
        {
            stream.Seek(0, 0);  // set the stream back to the begining
        }
    }
}

OTHER TIPS

View https://stackoverflow.com/a/16587134/206730 reference

Check the below links:

icsharpcode-sharpziplib-validate-zip-file

How-to-check-if-a-file-is-compressed-in-c#

ZIP files always start with 0x04034b50 (4 bytes)
View more: http://en.wikipedia.org/wiki/Zip_(file_format)#File_headers

Sample usage:

        bool isPKZip = IOHelper.CheckSignature(pkg, 4, IOHelper.SignatureZip);
        Assert.IsTrue(isPKZip, "Not ZIP the package : " + pkg);

// http://blog.somecreativity.com/2008/04/08/how-to-check-if-a-file-is-compressed-in-c/
    public static partial class IOHelper
    {
        public const string SignatureGzip = "1F-8B-08";
        public const string SignatureZip = "50-4B-03-04";

        public static bool CheckSignature(string filepath, int signatureSize, string expectedSignature)
        {
            if (String.IsNullOrEmpty(filepath)) throw new ArgumentException("Must specify a filepath");
            if (String.IsNullOrEmpty(expectedSignature)) throw new ArgumentException("Must specify a value for the expected file signature");
            using (FileStream fs = new FileStream(filepath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            {
                if (fs.Length < signatureSize)
                    return false;
                byte[] signature = new byte[signatureSize];
                int bytesRequired = signatureSize;
                int index = 0;
                while (bytesRequired > 0)
                {
                    int bytesRead = fs.Read(signature, index, bytesRequired);
                    bytesRequired -= bytesRead;
                    index += bytesRead;
                }
                string actualSignature = BitConverter.ToString(signature);
                if (actualSignature == expectedSignature) return true;
                return false;
            }
        }

    }

You can either:

Use a try-catch structure and try to read the structure of a potential zip file
Parse the file header to see if it is a zip file

ZIP files always start with 0x04034b50 as its first 4 bytes ( http://en.wikipedia.org/wiki/Zip_(file_format)#File_headers )

I used https://en.wikipedia.org/wiki/List_of_file_signatures, just adding an extra byte on for my zip files, to differentiate between my zip files and Word documents (these share the first four bytes).

Here is my code:

public class ZipFileUtilities
{
    private static readonly byte[] ZipBytes1 = { 0x50, 0x4b, 0x03, 0x04, 0x0a };
    private static readonly byte[] GzipBytes = { 0x1f, 0x8b };
    private static readonly byte[] TarBytes = { 0x1f, 0x9d };
    private static readonly byte[] LzhBytes = { 0x1f, 0xa0 };
    private static readonly byte[] Bzip2Bytes = { 0x42, 0x5a, 0x68 };
    private static readonly byte[] LzipBytes = { 0x4c, 0x5a, 0x49, 0x50 };
    private static readonly byte[] ZipBytes2 = { 0x50, 0x4b, 0x05, 0x06 };
    private static readonly byte[] ZipBytes3 = { 0x50, 0x4b, 0x07, 0x08 };

    public static byte[] GetFirstBytes(string filepath, int length)
    {
        using (var sr = new StreamReader(filepath))
        {
            sr.BaseStream.Seek(0, 0);
            var bytes = new byte[length];
            sr.BaseStream.Read(bytes, 0, length);

            return bytes;
        }
    }

    public static bool IsZipFile(string filepath)
    {
        return IsCompressedData(GetFirstBytes(filepath, 5));
    }

    public static bool IsCompressedData(byte[] data)
    {
        foreach (var headerBytes in new[] { ZipBytes1, ZipBytes2, ZipBytes3, GzipBytes, TarBytes, LzhBytes, Bzip2Bytes, LzipBytes })
        {
            if (HeaderBytesMatch(headerBytes, data))
                return true;
        }

        return false;
    }

    private static bool HeaderBytesMatch(byte[] headerBytes, byte[] dataBytes)
    {
        if (dataBytes.Length < headerBytes.Length)
            throw new ArgumentOutOfRangeException(nameof(dataBytes), 
                $"Passed databytes length ({dataBytes.Length}) is shorter than the headerbytes ({headerBytes.Length})");

        for (var i = 0; i < headerBytes.Length; i++)
        {
            if (headerBytes[i] == dataBytes[i]) continue;

            return false;
        }

        return true;
    }

 }

There may be better ways to code this particularly the byte compare, but as its a variable length byte compare (depending on the signature being checked), I felt at least this code is readable - to me at least.

If you are programming for Web, you can check the file Content Type: application/zip

Thanks to dkackman and Kiquenet for answers above. For completeness, the below code uses the signature to identify compressed (zip) files. You then have the added complexity that the newer MS Office file formats will also return match this signature lookup (your .docx and .xlsx files etc). As remarked upon elsewhere, these are indeed compressed archives, you can rename the files with a .zip extension and have a look at the XML inside.

Below code, first does a check for ZIP (compressed) using the signatures used above, and we then have a subsequent check for the MS Office packages. Note that to use the System.IO.Packaging.Package you need a project reference to "WindowsBase" (that is a .NET assembly reference).

    private const string SignatureZip = "50-4B-03-04";
    private const string SignatureGzip = "1F-8B-08";

    public static bool IsZip(this Stream stream)
    {
        if (stream.Position > 0)
        {
            stream.Seek(0, SeekOrigin.Begin);
        }

        bool isZip = CheckSignature(stream, 4, SignatureZip);
        bool isGzip = CheckSignature(stream, 3, SignatureGzip);

        bool isSomeKindOfZip = isZip || isGzip;

        if (isSomeKindOfZip && stream.IsPackage()) //Signature matches ZIP, but it's package format (docx etc).
        {
            return false;
        }

        return isSomeKindOfZip;
    }

    /// <summary>
    /// MS .docx, .xslx and other extensions are (correctly) identified as zip files using signature lookup.
    /// This tests if System.IO.Packaging is able to open, and if package has parts, this is not a zip file.
    /// </summary>
    /// <param name="stream"></param>
    /// <returns></returns>
    private static bool IsPackage(this Stream stream)
    {
        Package package = Package.Open(stream, FileMode.Open, FileAccess.Read);
        return package.GetParts().Any();
    }

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow