Prevent bot from crawling certain areas of site

https://stackoverflow.com/questions/2225971

19-09-2019
|

Question

I don't know much about SEO and how web spiders work, so forgive my ignorance here. I'm creating a site (using ASP.NET-MVC) which has areas that displays information retrieved from the database. The data is unique to the user, so there's no real server-side output caching going on. However, since the data can contain things the user may not wish to have displayed from search engine results, I'd like to prevent any spiders from accessing the search results page. Are there any special actions I should take to ensure that the search result directory isn't crawled? Also, would a spider even crawl a page that's dynamically generated and would any actions preventing certain directories being search mess up my search engine rankings?

edit: I should add, I'm reading up on robots.txt protocol, but it relies on co-operation from the web crawler. However, I'd also like to prevent any data-mining users who will ignore the robots.txt file.

I appreciate any help!

Solution

You can prevent some malicious clients from hitting your server too heavily by implementing throttling on the server. "Sorry, your IP has made too many requests to this server in the past few minutes. Please try again later." In practice, though, assume that you can't stop a truly malicious user from bypassing any throttling mechanisms that you put in place.

Given that, here's the more important question:

Are you comfortable with the information that you're making available for all the world to see? Are your users comfortable with this?

If the answer to those questions is no, then you should be ensuring that only authorized users are able to see the sensitive information. If the information isn't particularly sensitive but you don't want clients crawling it, throttling is probably a good alternative. Is it even likely that you're going to be crawled anyway? If not, robots.txt should be just fine.

OTHER TIPS

It seems like you have 2 issues.

Firstly a concern about certain data appearing in search results. The second about malicious or unscrupulous user harvesting user related data.

The first issue will be covered by appropriate use of a robots.txt file as all the big search engines honour this.

The second issue seems more to do with data privacy. The first question which immediately springs to mind is: If there is user information which people may not want displayed, why are you making it available at all?
What is the privacy policy for such data?
Do users have the ability to control what information is made available?
If the information is potentially sensitive but important to the system could it be restricted so it is only available to logged in users?

Check out the Robots exclusion standard. It's a text file that you put on your site that tells a bot what it can and can't index. You will also want to address what happens if a bot doesn't honour the robots.txt file.

robots.txt file as mentioned. If that is not enough then you can:

Block unknown useragents - hard to maintain, easy for a bot to forge a browser's (although most legitimate bots wont)
Block unknown IP addresses - not useful for a public site
Require logins
Throttle user connections - tricky to tune, you will still be disclosing information.

Perhaps by using a combination. Either way it is a trade off, if the public can browse to it, so can a bot. Be sure you don't block & alienate people in your attempts to block bots.

a few options:

force the user to login to view the content
add a CAPTCHA page before the content
embed content in Flash
load dynamically with JavaScript

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow