Question

I have a serious question. Is it ever ethical to ignore the presence of a robots.txt file on a website? These are some of the considerations I've got in mind:

  1. If someone puts a web site up they're expecting some visits. Granted, web crawlers are using bandwidth without clicking on ads that may support the site but the site owner is putting their site on the web, right, so how reasonable is it for them to expect that they'll never get visited by a bot?

  2. Some sites apparently use a robots.txt exactly in order to keep their site from being crawled by Google or some other utility that might grab prices and therefore allow people to do price comparisons easily. They have private search engines on the site so they obviously want people to be able to search the site; apparently they just don't want people to be able to easily compare their information with other vendors.

As I said, I'm not trying to be argumentative; I would just like to know if anyone has ever come up with a case where it's ethically permissible to ignore the presence of a robots.txt file? I cannot think of a case where it's permissible to ignore the robots.txt mainly because people (or businesses) are paying money to put up their web sites so they should be able to tell the Googles/Yahoos/Other SE's of the world that they don't want to be on their indices.

To put this discussion in context, I'd like to create a price comparison website and one of the major vendors has a robots.txt that basically prevents anyone from grabbing their prices. I'd like to be able to get their information but, as I said, I can't justify simply ignoring the wishes of the site owner.

I have seen some very sharp discussion here and that's why I would like to hear the opinions of developers that follow Stack Overflow.

By the way, there is some discussion of this topic on a Hacker News question but they seem to mainly focus on the legal aspects of this.

Was it helpful?

Solution

Arguments:

  1. A robots.txt file is an implied license, especially since you are aware of it. Thus, continuing to scrape their site could be seen as unauthorized access (i.e., hacking). Sucks, but arguments like this have been made in other legal cases recently (not directly related to robots.txt, but in relation to other "passive controls".)
  2. Grabbing prices violates no copyright law, including DMCA, since copyright does not include factual information, only creative.
  3. Ethically, you should not grab prices because the vendor should have the ability to change prices without worrying about being accused of a bait/switch by people coming from your site.
  4. Have you taken the high road, explaining the site to them and saying you'd love to include them in your list of vendors? Maybe they will love the idea and actually expose the data in a way that is easy for you to consume and less resource-intensive for them to produce.
  5. There are no laws written directly about robots.txt because netiquette is generally followed. Don't be one of the "bad guys."
  6. Some people filter robots because they use URL links to perform "actions" like adding things to carts, and robots leave them with massive numbers of abandoned shopping carts in their database.
  7. Some people filter robots because they have exclusive prices that they can't advertise openly based on agreements with their vendors. You could be putting them in a bad position by exposing those prices on your site.
  8. In this economy, if a company doesn't want to do everything possible to advertise themselves, it's their own fault that you don't include them.

OTHER TIPS

The other use of robots.txt is to help protect web spiders from themselves. It's relatively easy for a web spider to get mired in an infinitely deep forest of links, and a properly constructed robots.txt file will tell the spider that "you don't need to go here".

Many people have tried to build businesses off building "price comparison" engines that scraped major sites.

Once you start getting any sort of traffic/revenue to speak of, you will receive a cease and desist. It's happened to dozens, if not hundreds of projects. I even worked on a small project that received a C&D from Craigslist.

You know how they say "It's easier to ask forgiveness than it is to get permission"? It doesn't hold true with page scraping. Get permission, or you will be hearing from their lawyers.

If you're lucky, it'll be early on, when you've got nothing to lose. If it's late, you may lose your business and all your work overnight, with a single letter.

Getting permission shouldn't be hard. Unless you're doing something sneaky, you're likely going to drive them additional traffic. Hell, once your product takes off, sites may be begging you, or even paying you to add their data.

"No" means "no".

To answer the narrow question, for the price comparison website you're probably best grabbing the price in real time, rather then scrapping the database in advance. Hard to imagine that being a problem.

One reason we allow robots to dig through the web without complaint is that we have a way to stop them if we want to. Protects both sides.

Remember the uproar when Cuil's robots were accused of going over-the-top, apparently acting like a DoS attack in some cases and using up bandwidth allowances of some small sites?

If too many people violate robots.txt we might get something worse.

An interesting IRL version of story involving The Harvard Coop: Coop Calls Cops On ISBN Copiers.

Short answer: No.

On the narrow issue: If a seller says that their prices are secret, I think you have to respect that. I'd contact them and ask if they really don't want price comparison engines like yours to include them, or if the "no trespassing" sign is for technical reasons. If the latter, perhaps they'll provide you with an alternative. If the former, then I'd say too bad, they don't get included, they lose some business, and it's their problem.

Tangential rant: Personally, I get pretty annoyed with companies that make me jump through hoops to find out the price of their products, places that make me call and talk to a salesman so he can give me a hard-sell pitch, or worse, make me give them my phone number so their salesman can call and harass me. I figure that if they're afraid to tell me the price, it probably means that it's too high.

In general: A robots.txt file is like a "No Trespassing" sign. It's the owner's right to say who is allowed on their property. If you think their reasons are dumb, you can politely suggest they take the sign down. But you don't have the right to disregard their wishes. If someone puts a No Trespassing sign on his yard, and I say, "Hey, I just want to take a quick short cut, what's the big deal?" -- Maybe I'm stepping on his prized Bulgarian violet bulbs and destroying a valuable investment. Maybe I'm crossing his people's sacred burial ground and offending their religious sensibilities. Or maybe he's just an ornery jerk. But it's still his property and his right. Oh, and if I fall into the dangerous sinkhole after ignoring the No Trespassing sign, who's to blame? (In America, I could probably still sue him for all he's worth despite the fact that he warned me, but is that right?)

I'm showing some ignorance here, but I always thought a bot was something only sent out by a search engine. Like Google or Yahoo.

Thus, if you wrote an application that searched content on the internet, I wouldn't consider that a search engine bot, which to my knowledge is what robots.txt is trying to block.

But this may just be selective ignorance, because I might do it until the webmaster of that site contacted me and asked me to stop :)

If people make it available to public access, they shouldn't try to put limits on it. Adding a robots.txt file to your site is the equivalent to putting a sign on your lawn that says "Please don't look at me."

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top