Question

Let's say I have a web site for hosting community generated content that targets a very specific set of users. Now, let's say in the interest of fostering a better community I have an off-topic area where community members can post or talk about anything they want, regardless of the site's main theme.

Now, I want most of the content to get indexed by Google. The notable exception is the off-topic content. Each thread has it's own page, but all the threads are listed in the same folder so I can't just exclude search engines from a folder somewhere. It has to be per-page. A traditional robots.txt file would get huge, so how else could I accomplish this?

Was it helpful?

Solution

This will work for all well-behaving search engines, just add it to the <head>:

<meta name="robots" content="noindex, nofollow" />

OTHER TIPS

If using Apache I'd use mod-rewrite to alias robots.txt to a script that could dynamically generate the necessary content.

Edit: If using IIS you could use ISAPIrewrite to do the same.

Simlarly to @James Marshall's suggestion - in ASP.NET you could use an HttpHandler to redirect calls to robots.txt to a script which generated the content.

You can implement it by substituting robots.txt with dynamic script generating the output. With Apache You could make simple .htaccess rule to acheive that.

RewriteRule  ^robots\.txt$ /robots.php [NC,L]

Just for that thread , make sure your head contains a noindex meta tag. Thats one more way to tell search engines not to crawl your page other than blocking in robots.txt

Just keep in mind that a robots.txt disallow will NOT prevent Google from indexing pages that have links from external sites, all it does is prevent crawling internally. See http://www.webmasterworld.com/google/4490125.htm or http://www.stonetemple.com/articles/interview-matt-cutts.shtml.

You can disallow search engines to read or index your content by restricting robot meta tags. In this way, spider will consider your instructions and will index only such pages that you want.

block dynamic webpage by robots.txt use this code


User-agent: *

Disallow: /setnewsprefs?

Disallow: /index.html?

Disallow: /?

Allow: /?hl=

Disallow: /?hl=*&

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top