Question

Data in the form of search strings continue to grow as new virus variants are released, which prompts my question - how do AV engines search files for known signatures so efficiently? If I download a new file, my AV scanner rapidly identifies the file as being a threat or not, based on its signatures, but how can it do this so quickly? I'm sure by this point there are hundreds of thousands of signatures.

Was it helpful?

Solution

UPDATE: As tripleee pointed out, the Aho-Corasick algorithm seems very relevant to virus scanners. Here is some stuff to read:

http://www.dais.unive.it/~calpar/AA07-08/aho-corasick.pdf

http://www.researchgate.net/publication/4276168_Generalized_Aho-Corasick_Algorithm_for_Signature_Based_Anti-Virus_Applications/file/d912f50bd440de76b0.pdf

http://jason.spashett.com/av/index.htm

Aho-Corasick-like algorithm for use in anti-malware code

Below is my old answer. Its still relevant for easily detecting malware like worms which simply make copies of themselves:

I'll just write some of my thoughts on how AVs might work. I don't know for sure. If someone thinks the information is incorrect, please notify me.

There are many ways in which AVs detect possible threats. One way is signature-based detection.

A signature is just a unique fingerprint of a file (which is just a sequence of bytes). In terms of computer science, it can be called a hash. A single hash could take about 4/8/16 bytes. Assuming a size of 4 bytes (for example, CRC32), about 67 million signatures could be stored in 256MB.

All these hashes can be stored in a signature database. This database could be implemented with a balanced tree structure, so that insertion, deletion and search operations can be done in O(logn) time, which is pretty fast even for large values of n (n is the number of entries). Or else if a lot of memory is available, a hashtable can be used, which gives O(1) insertion, deletion and search. This is can be faster as n grows bigger and a good hashing technique is used.

So what an antivirus does roughly is that it calculates the hash of the file or just its critical sections (where malicious injections are possible), and searches its signature database for it. As explained above, the search is very fast, which enables scanning huge amounts of files in a short amount of time. If it is found, the file is categorized as malicious.

Similarly, the database can be updated quickly since insertion and deletion is fast too.

You could read these pages to get some more insight.

Which is faster, Hash lookup or Binary search?

https://security.stackexchange.com/questions/379/what-are-rainbow-tables-and-how-are-they-used

OTHER TIPS

Many signatures are anchored to a specific offset, or a specific section in the binary structure of the file. You can skip the parts of a binary which contain data sections with display strings, initialization data for internal structures, etc.

Many present-day worms are stand-alone files for which a whole-file signature (SHA1 hash or similar) is adequate.

The general question of how to scan for a large number of patterns in a file is best answered with a pointer to the Aho-Corasick algorithm.

I don't know how a practical AV works. but I think the question have some relative with finding words in a long text with a given dictionary.

For the above question, data structures like TRIE will make it very fast. processing a Length=N text dictionary of K words takes only O(N) time.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top