Should I be concerned if googlebot is trying to index marketing URLs?

https://stackoverflow.com/questions/814554

03-07-2019
|

Question

I have recently started using Google Webmaster Tools.

I was quite surprised to see just how many links google is trying to index.

http://www.example.com/?c=123
http://www.example.com/?c=82
http://www.example.com/?c=234
http://www.example.com/?c=991

These are all campaigns that exist as links from partner sites.

For right now they're all being denied by my robots file until the site is complete - as is EVERY page on the site.

I'm wondering what is the best approach to deal with links like this is - before I make my robots.txt file less restrictive.

I'm concerned that they will be treated as different URLS and start appearing in google's search results. They all correspond to the same page - give or take. I dont want people finding them as they are and clicking on them.

By best idea so far is to render a page that contains a query string as follows :

 // DO NOT TRY THIS AT HOME. See edit below
 <% if (Request.QueryString != "") { %>

    <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

 <% } %>

Do I need to do this? Is this the best approach?

Edit: This turns out NOT TO BE A GOOD APPROACH. It turns out that Google is seeing NOINDEX on a page that has the same content as another page that does not have NOINDEX. Apparently it figures they're the same thing and the NOINDEX takes precedence. My site completely disappeared from Google as a result. Caveat: it could have been something else i did at the same time, but i wouldn't risk this approach.

Solution

This is the sort of thing that rel="canonical" was designed for. Google posted a blog article about it.

OTHER TIPS

Yes, Google would interprete them as different URLs.

Depending on your webserver you could use a rewrite filter to remove the parameter for search engines, eg url rewrite filter for Tomcat, or mod rewrite for Apache.

Personally I'd just redirect to the same page with the tracking parameter removed.

That seems like the best approach unless the page exists in it's own folder in which case you can modify the robots.txt file just to ignore that folder.

For resources that should not be indexed I prefer to do a simple return in the page load:

if (IsBot(Request.UserAgent)
    return;

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow