Question

I am about to create a robots.txt file.

I am using notepad.

How should I save the file? UTF8, ANSI or what?

Also, should it be a capital R?

And in the file, I am specifying a sitemap location. Should this be with a capital S?

  User-agent: *
  Sitemap: http://www.domain.se/sitemap.xml

Thanks

Was it helpful?

Solution

Since the file should consist of only ASCII characters, it normally doesn't matter if you save it as ANSI or UTF-8.

However, you should choose ANSI if you have a choice because when you save a file as UTF-8, notepad adds the Unicode Byte Order Mark to the front of the file, which may make the file unreadable for interpreters that only know ASCII.

OTHER TIPS

As for the encoding: @Roland already nailed it. The file should contain only URLs. Non-ASCII characters in URLs are illegal, so saving the file as ASCII should be just fine.

If you need to serve UTF-8 for some reason, make sure this is specified correctly in the content-type header of the text file. You will have to set this in your web server's settings.

As to case sensitivity:

  • According to robotstxt.org, the robots.txt file needs to be lowercase:

    Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT.

  • The keywords are probably case insensitive - I can't find a reference on that - but I would tend to do what all the others do: Use capitalized versions (Sitemap).

I believe Robots.txt "should" be UTF-8 encoded.

"The expected file format is plain text encoded in UTF-8. The file consists of records (lines) separated by CR, CR/LF or LF."

/from https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

But, notepad and other programs will insert a 3 byte BOM (Byte Order Mark) in the beginning of the file causing Google to not being able to read that first line (showing an "invalid syntax" error).

Either; remove the BOM, or much easier, Add a line break on the first row so that the first line of instructions comes on line number two.

The "invalid syntax" line caused by the BOM will only affect the first line which now is empty.

The rest of the lines will be read successfully.

I think you're over thinking things too much. I always do lowercase, just because it's easier.

You can view SO's robots.txt. https://stackoverflow.com/robots.txt

I recommend either encoding robots.txt in UTF8, without BOM, or encoding it in ASCII.

For URLs that contain non ASCII characters, I suggest either using UTF8, which is fine in most cases, or use URL-encode to represent all of the characters in ASCII.

Take a look at Wikipedia's robots.txt file - it's UTF8 encoded.

See references:

I suggest you to use ANSI, because if your robots.txt is saved as UTF-8, then it will be marked as faulty in Google's Search Console due to the Unicode Byte Order Mark that's added to its beginning (as mentioned from Roland Illig above).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top