Question

Short question:

Has anybody got any C# code to parse robots.txt and then evaluate URLS against it so see if they would be excluded or not.

Long question:

I have been creating a sitemap for a new site yet to be released to google. The sitemap has two modes, a user mode (like a traditional sitemap) and an 'admin' mode.

The admin mode will show all possible URLS on the site, including customized entry URLS or URLS for a specific outside partner - such as example.com/oprah for anyone who sees our site on Oprah. I want to track published links somewhere other than in an Excel spreadsheet.

I would have to assume that someone might publish the /oprah link on their blog or somewhere. We don't actually want this 'mini-oprah site' to be indexed because it would result in non-oprah viewers being able to find the special Oprah offers.

So at the same time I was creating the sitemap I also added URLS such as /oprah to be excluded from our robots.txt file.

Then (and this is the actual question) I thought 'wouldn't it be nice to be able to show on the sitemap whether or not files are indexed and visible to robots'. This would be quite simple - just parse robots.txt and then evaluate a link against it.

However this is a 'bonus feature' and I certainly don't have time to go off and write it (even thought its probably not that complex) - so I was wondering if anyone has already written any code to parse robots.txt ?

Was it helpful?

Solution

Hate to say that, but just google "C# robots.txt parser" and click the first hit. It's a CodeProject article about a simple search engine implemented in C# called "Searcharoo", and it contains a class Searcharoo.Indexer.RobotsTxt, described as:

  1. Check for, and if present, download and parse the robots.txt file on the site
  2. Provide an interface for the Spider to check each Url against the robots.txt rules

OTHER TIPS

I like the code and tests in http://code.google.com/p/robotstxt/ would recommend it as a starting point.

A bit of self promoting, but since I needed a similar parser and couldn't find anything I was happy with, I created my own:

http://nrobots.codeplex.com/

I'd love any feedback

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top