How I do to block Web scraping without blocking Well behaved bots?

https://stackoverflow.com/questions/587896

06-09-2019
|

Question

I'm building an e-commerce website with a large database of products. Of course, is nice when Goggle indexes all products of the website. But what if some competitor wants Web Scrap the website and get all images and product descriptions?

I was observing some websites with similar lists of products, and they place a CAPTCHA, so "only humans" can read the list of products. The drawback is... it is invisible for Google, Yahoo or another "Well behaved" bots.

Solution

You can discover the IP addresses the Google and others are using by checking visitor IPs with whois (in the command line or on a web site). Then, once you've accumulated a stash of legit search engines, allow them into your product list without the CAPTCHA.

OTHER TIPS

If you're worried about competitors using your text or images, how about a watermark or customized text?

Let them take your images and you'd have your logo on their site!

Since a potential screen-scaping application can spoof the user agent and HTTP referrer (for images) in the header and use a time schedule that is similar to a human browser, it is not possible to completely stop professional scrapers. But you can check for these things nevertheless and prevent casual scraping. I personally find Captchas annoying for anything other than signing up on a site.

One technique you could try is the "honey pot" method: it can be done either by mining log files are via some simple scripting.

The basic process is you build your own "blacklist" of scraper IPs based by looking for IP addresses which look at 2+ unrelated products in a very short period of time. Chances are these IPs belong to Machines. You can then do a reverse lookup on them to determine if they are nice (like GoogleBot or Slurp) or bad.

Block webscrapers is not easy, and it's even harder trying to avoid false positives.

Anyway you can add some netrange to a whitelist, and don't serve any captcha to them. All those well known crawlers: Bing, Googlebot, Yahoo etc.. use always specific netranges when crawling, and all those IP addresses resolve to specific reverse lookups.

Few examples:

Google IP 66.249.65.32 resolves to crawl-66-249-65-32.googlebot.com

Bing IP 157.55.39.139 resolves to msnbot-157-55-39-139.search.msn.com

Yahoo IP 74.6.254.109 resolves to h049.crawl.yahoo.net

So let's say that '*.googlebot.com ', '*.search.msn.com ' and '*.crawl.yahoo.net ' addresses should be whitelisted.

There are plenty of white lists you can implement out on internet.

Said that, I don't believe Captcha is a solution against advanced scrapers, since services such as deathbycaptcha.com or 2captcha.com promise to solve any kind of captcha within seconds.

Please have a look into our wiki http://www.scrapesentry.com/scraping-wiki/ we wrote many articles on how to prevent, detect and block web-scrapers.

Perhaps I over-simplify, but if your concern is about server performance then providing an API would lessen the need for scrapers, and save you band/width processor time.

Other thoughts listed here:

http://blog.screen-scraper.com/2009/08/17/further-thoughts-on-hindering-screen-scraping/

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow