Question

i need to search for virus signatures in file and i am using java to do this i have programmed all the other features such as gathering files and filtering them into ones that need to be search etc. i just need a little help with the virus signature side.

what format to use (hashed string, binary, bytes)?

what method i should use to scan for the string (search algorithm, etc)?

i was thinking of turning the file into bytes and then using a Boyer–Moore string search algorithm to search for the bytes.

i want to use the virus signatures from a signature file and scan a file for them.

 public void Search(File file) {

    if (file.exists()) {

        if (file.isDirectory()) {
            if (file.canRead()) {

                File[] listOfFiles = file.listFiles();
                if (listOfFiles != null) {
                    for (int i = 0; i < listOfFiles.length; i++) {
                        Search(listOfFiles[i]);
                    }
                }
            } else {
                cannotReadDirCount++;
            }
        } else if (file.isFile()) {

            if (file.canRead()) {

                totalFileCount++;

                for (int a = 0; a < executableCriteriaList.size(); a++) {

                    if (file.getName().endsWith(executableCriteriaList.get(a).toLowerCase()) || file.getName().endsWith(executableCriteriaList.get(a).toUpperCase())) {

                        // scanExecutableFile(file); HERE IS where i need to scan the file
                        searchFiles.add(file);
                    }

                }

            } else {
                cannotReadFileCount++;
            }

        }
    } else {
        cannotReadFileCount++;
    }
}

Thanks for the Help

Was it helpful?

Solution

If you were scanning for just one virus signature, then a single string search algorithm like Boyer-Moore would be a good choice. (There are other fast single search algorithms too.)

But a virus scanner typically looks for many virus signatures, and the signatures are typically not just simple sequence-of-byte signatures.

If you are looking for the (technically) best algorithm, then I suggest you read the Wikipedia page on String Search Algorithms, and consider all of the alternatives that it links to. That's only a start, since there are (apparently) other search algorithms that are not listed there.

As to the best representation of the signatures, that will depend on what search algorithms you use. But since you are looking for byte patterns in code objects, a byte-based representation (byte strings or byte-based patterns / regexes) seems most appropriate.

(I don't see how hashes would actually help you with this problem ...)


But that assumes that you really need the best search technology that is available. It sounds like this is an assignment you are doing, and for that a your original choice of Boyer-Moore is fine. A simple approach is to read each file into memory, and then do a Boyer-Moore search for each virus signature. That won't be as fast as a commercial / professional virus scanner, but it should be good enough for your purposes.

OTHER TIPS

There are several algorithms that will help you. I suggest Aho-Corasick or Rabin-Karp, but a suffix tree may also come in handy. Rabin-Karp is the easiest to implement of those, but Aho-Corasick does not use hashes and so you don't need to take special care of collisions.

The Boyer–Moore technique isn't used for the virus signatures used by various antivirus software vendors. They mostly use MD5, SHA1, SHA256, or text fingerprints on either the whole file or sections of a file. The largest database you'll find is mostly SHA1 whole file hashes.

Cisco's ClamAV's source is publicly available on Github. Also, their CVD files are documented on how to crack them open to look at their various hashes. It's a Gzipped TAR file (.tar.gz) with a series of bytes for the header, and then renamed into a .cvd file. Some scripts exist to extract the tar.gz out. Inside, are various character-delimited text files of various formats that are the virus definition "databases". The delimiter changes in the files, but is often a colon.

When you look at that, you learn that virus signatures are done in various ways:

  • MD5 whole file hashes. This was the original technique, but eventually had false positives because MD5 only has so much address space. It's still used for older files that have not yet had a false positive, but it is phased out. However, Clam and most other AV apps use this for at least some small percentage of their scans. They will do so until they encounter a false positive. And, if so, will switch it to SHA1.

  • SHA1 whole file hashes. This came after the MD5 because it has more address space. Unfortunately, though, this too ran out of address space and had false positives eventually, so they moved to SHA256. However, these are still used until they are marked as defunct because of a false positive, and then are switched to SHA256. You'll find with ClamAV that the SHA1 whole file hash is the most common hash recorded.

  • SHA256 whole file hashes. This is available, but not always used. The reason is because it increases the size of the definition files compared to SHA1 hashes. So, for now, virus definitions are primarily stored in SHA1 whole file hashes by default unless a SHA256 is necessary due to false positive collision with another file.

  • PE section hashes -- stored in MD5, SHA1, and SHA256. Some viruses mutate, and the only way to catch them is to generate a hash based on what is called a PE section of an executable file. There are multiple PE sections in an executable. Again, Clam started with MD5, but then introduced SHA1 and SHA256 on false positive collision.

  • File Fingerprints. These are little UTF8 text strings detected in files that are linked to malicious activity, such as web pages that may not be executable files.

  • And others... Documentation: https://github.com/vrtadmin/clamav-devel/blob/master/docs/signatures.pdf

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top