Question

I'm developing a website and am sensitive to people screen scraping my data. I'm not worried about scraping one or two pages -- I'm more concerned about someone scraping thousands of pages as the aggregate of that data is much more valuable than a small percentage would be.

I can imagine strategies to block users based on heavy traffic from a single IP address, but the Tor network sets up many circuits that essentially mean a single user's traffic appears to come from different IP addresses over time.

I know that it is possible to detect Tor traffic as when I installed Vidalia with its Firefox extension, google.com presented me with a captcha.

So, how can I detect such requests?

(My website's in ASP.NET MVC 2, but I think any approach used here would be language independent)

Was it helpful?

Solution

I'm developing a website and am sensitive to people screen scraping my data

Forget about it. If it's on the web and someone wants it, it will be impossible to stop them from getting it. The more restrictions you put in place, the more you'll risk ruining user experience for legitimate users, who will hopefully be the majority of your audience. It also makes code harder to maintain.

I'll post countermeasures to any ideas future answers propose.

OTHER TIPS

You can check their ip address against a list of Tor Exit Nodes. I know for a fact this won't even slow someone down who is interested in scraping your site. Tor is too slow, most scrapers won't even consider it. There are tens of thousands of open proxy servers that can be easily scanned for or a list can be purchased. Proxy servers are nice because you can thread them or rotate if your request cap gets hit.

Google has been abused by tor users and most of the exit nodes are on Google black list and thats why you are getting a captcha.

Let me be perfectly clear: THERE IS NOTHING YOU CAN DO TO PREVENT SOMEONE FROM SCRAPING YOUR SITE.

By design of the tor network components it is not possible for the receiver to find out if the requester is the original source or if it's just a relayed request.

The behaviour you saw with Google was probably caused by a different security measure. Google detects if a logged-in user changes it's ip and presents a captcha just in case to prevent harmful interception and also allow the continuation of the session if an authenticated user really changed its IP (by re-logon to ISP, etc.).

I know this is old, but I got here from a Google search so I figured I'd get to the root concerns in the question here. I develop web applications, but I also do a ton of abusing and exploiting other peoples. I'm probably the guy you're trying to keep out.

Detecting tor traffic really isn't the route you want to go here. You can detect a good amount of open proxy servers by parsing request headers, but you've got tor, high anonymity proxies, socks proxies, cheap VPNs marketed directly to spammers, botnets and countless other ways to break rate limits. You also

If your main concern is a DDoS effect, don't worry about it. Real DDoS attacks take either muscle or some vulnerability that puts strain on your server. No matter what type of site you have, you're going to be flooded with hits from spiders as well as bad people scanning for exploits. Just a fact of life. In fact, this kind of logic on the server almost never scales well and can be the single point of failure that leaves you open to a real DDoS attack.

This can also be a single point of failure for your end users (including friendly bots). If a legitimate user or customer gets blocked you've got a customer service nightmare and if the wrong crawler gets blocked you're saying goodbye to your search traffic.

If you really don't want anybody grabbing your data, there are some things you can do. If it's a blog content or something, I generally say either don't worry about it or have summary only RSS feeds if you need feeds at all. The danger with scraped blog content is that it's actually pretty easy to take an exact copy of an article, spam links to it and rank it while knocking the original out of the search results. At the same time, because it's so easy people aren't going to put effort into targeting specific sites when they can scrape RSS feeds in bulk.

If your site is more of a service with dynamic content that's a whole other story. I actually scrape a lot of sites like this to "steal" huge amounts of structured proprietary data, but there are options to make it harder. You can limit the request per IP, but that's easy to get around with proxies. For some real protection relatively simple obfuscation goes a long way. If you try to do something like scrape Google results or download videos from YouTube you'll find out there's a lot to reverse engineer. I do both of these, but 99% of people who try fail because they lack the knowledge to do it. They can scrape proxies to get around IP limits but they're not breaking any encryption.

As an example, as far as I remember a Google result page comes with obfuscated javscript that gets injected into the DOM on page load, then some kind of tokens are set so you have to parse them out. Then there's an ajax request with those tokens that returns obfuscated JS or JSON that's decoded to build the results and so on and so on. This isn't hard to do on your end as the developer, but the vast majority of potential thieves can't handle it. Most of the ones that can won't put in the effort. I do this to wrap really valuable services Google but for most other services I just move on to some lower hanging fruit at different providers.

Hope this is useful for anyone coming across it.

I think the focus on how it is 'impossible' to prevent a determined and technically savvy user from scraping a website is given too much significance. @Drew Noakes states that the website contains information that when taken in aggregate has some 'value'. If a website has aggregate data that is readily accessible by unconstrained anonymous users, then yes, preventing scraping may be near 'impossible'.

I would suggest the problem to be solved is not how to prevent users from scraping the aggregate data, but rather what approaches could be used to remove the aggregate data from public access; thereby eliminating the target of the scrapers without the need to do the 'impossible', prevent scrapping.

The aggregate data should be treated like proprietary company information. Proprietary company information in general is not available publicly to anonymous users in an aggregate or raw form. I would argue that the solution to prevent the taking of valuable data would be to restrict and constrain access to the data, not to prevent scrapping of it when it is presented to the user.

1] User accounts/access – no one should ever have access to all the data in a within a given time period (data/domain specific). Users should be able to access data that is relevant to them, but clearly from the question, no user would have a legitimate purpose to query all the aggregate data. Without knowing the specifics of the site, I suspect that a legitimate user may need only some small subset of the data within some time period. Request that significantly exceed typical user needs should be blocked or alternatively throttled, so as to make scraping prohibitively time consuming and the scrapped data potentially stale.

2] Operations teams often monitor metrics to ensure that large distributed and complex systems are healthy. Unfortunately, it becomes very difficult to identify the causes of sporadic and intermittent problems, and often it is even difficult to identify that there is a problem as opposed to normal operational fluctuations. Operations teams often deal with statistical analysed historical data taken from many numerous metrics, and comparing them to current values to help identify significant deviations in system health, be they system up time, load, CPU utilization, etc.

Similarly, requests from users for data in amounts that are significantly greater than the norm could help identify individuals that are likely to be scrapping data; such an approach can even be automated and even extended further to look across multiple accounts for patterns that indicate scrapping. User 1 scrapes 10%, user 2 scrapes the next 10%, user 3 scrapes the next 10%, etc... Patterns like that (and others) could provide strong indicators of malicious use of the system by a single individual or group utilizing multiple accounts

3] Do not make the raw aggregate data directly accessible to end-users. Specifics matter here, but simply put, the data should reside on back end servers, and retrieved utilizing some domain specific API. Again, I assuming that you are not just serving up raw data, but rather responding to user requests for some subsets of the data. For example, if the data you have is detailed population demographics for a particular region, a legitimate end user would be interested in only a subset of that data. For example, an end user may want to know addresses of households with teenagers that reside with both parents in multi-unit housing or data on a specific city or county. Such a request would require the processing of the aggregate data to produce a resultant data set that is of interest to the end-user. It would prohibitively difficult to scrape every resultant data set retrieved from numerous potential permutations of the input query and reconstruct the aggregate data in its entirety. A scraper would also be constrained by the websites security, taking into account the # of requests/time, the total data size of the resultant data set, and other potential markers. A well developed API incorporating domain specific knowledge would be critical in ensuring that the API is comprehensive enough to serve its purpose but not overly general so as to return large raw data dumps.

The incorporation of user accounts in to the site, the establishment of usage baselines for users, the identification and throttling of users (or other mitigation approaches) that deviate significantly from typical usage patterns, and the creation of an interface for requesting processed/digested result sets (vs raw aggregate data) would create significant complexities for malicious individuals intent on stealing your data. It may be impossible to prevent scrapping of website data, but the 'impossibility' is predicated on the aggregate data being readily accessible to the scraper. You can't scrape what you can't see. So unless your aggregate data is raw unprocessed text (for example library e-books) end users should not have access to the raw aggregate data. Even in the library e-book example, significant deviation from acceptable usage patterns such as requesting large number of books in their entirety should be blocked or throttled.

You can detect Tor users using TorDNSEL - https://www.torproject.org/projects/tordnsel.html.en.

You can just use this command-line/library - https://github.com/assafmo/IsTorExit.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top