Question

I am very curious to know how malware detection (like google's safebrowsing) techniques work? Googling does not help my cause. I found some thing called cuckoobox which do such things.

Exactly how Malware detection of a website works? What may be the algorithm for that? What algorithm google safebrowsing etc uses?

Any python script available?

Was it helpful?

Solution

It is an interesting problem which is best served using multiple solutions.

Google probably keeps a list of malicious domains, visit the domain - did it attempt to serve you an .exe without user interaction? Does the content seem to be gibberish? And other such quantifiers. - Mark as malicious. Visit another domain, did it redirect you to the one in your list which is malicious? Mark as untrusted. Then you can apply machine learning/regression analasis to increase the confidence and decrease false positives. You could go further and have a light scan for some domains and a deep scan for other domains (because deep scan may use something like cuckoo which takes more resources). Is the domain name a sensible word and does it match the whois information? Or is it gibberish?

Another approach is to keep a list of known exploits (thier names and code-signature) for vulnerabilities in web-browsers and common plugins, then see if the web site attempts to serve you an exploit which you know about. To generate a list of known exploits, just scan CVE or another open database and fetch the exploits, make a hash out of them and so on... so this will not catch all of the crap, but most of it.

OTHER TIPS

Essentially what browsers do is just query Google's huge database of known malware sites for the URL/domain in question.

How Google builds up that database is a different story. They probably work together with various researchers and antivirus products to detect already known threats. Apart from that, they probably have some automatic detection of "suspicious" URLs or document contents (Flash, PDF, Java or browser exploit triggers, shellcode, ROP chains, heap spray scripts, ...). After all, they already have to look at all the contents for indexing, so they can easily perform relatively complex analysis. They also know of URLs pointed to by spam and phishing mails through their Mail service. What they probably don't do is manual malware analysis using sandboxing and such, this is the job of security/antivirus companies.

So all in all, this is quite a complex task. And no, there is no single Python script available that does that job (although if you're really interested in this, you'll find that there are actually a lot of small helper scripts and also more complex frameworks written in dynamic languages like Ruby or Python). Some projects you could look at to get started (and that are actually general enough to be very useful for other tasks as well):

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top