robots.txt; What encoding?
-
26-09-2019 - |
Question
I am about to create a robots.txt file.
I am using notepad.
How should I save the file? UTF8, ANSI or what?
Also, should it be a capital R?
And in the file, I am specifying a sitemap location. Should this be with a capital S?
User-agent: *
Sitemap: http://www.domain.se/sitemap.xml
Thanks
Solution
Since the file should consist of only ASCII characters, it normally doesn't matter if you save it as ANSI or UTF-8.
However, you should choose ANSI if you have a choice because when you save a file as UTF-8, notepad adds the Unicode Byte Order Mark to the front of the file, which may make the file unreadable for interpreters that only know ASCII.
OTHER TIPS
As for the encoding: @Roland already nailed it. The file should contain only URLs. Non-ASCII characters in URLs are illegal, so saving the file as ASCII should be just fine.
If you need to serve UTF-8 for some reason, make sure this is specified correctly in the content-type
header of the text file. You will have to set this in your web server's settings.
As to case sensitivity:
According to robotstxt.org, the robots.txt file needs to be lowercase:
Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT.
The keywords are probably case insensitive - I can't find a reference on that - but I would tend to do what all the others do: Use capitalized versions (
Sitemap
).
I believe Robots.txt "should" be UTF-8 encoded.
"The expected file format is plain text encoded in UTF-8. The file consists of records (lines) separated by CR, CR/LF or LF."
/from https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
But, notepad and other programs will insert a 3 byte BOM (Byte Order Mark) in the beginning of the file causing Google to not being able to read that first line (showing an "invalid syntax" error).
Either; remove the BOM, or much easier, Add a line break on the first row so that the first line of instructions comes on line number two.
The "invalid syntax" line caused by the BOM will only affect the first line which now is empty.
The rest of the lines will be read successfully.
I think you're over thinking things too much. I always do lowercase, just because it's easier.
You can view SO's robots.txt. https://stackoverflow.com/robots.txt
I recommend either encoding robots.txt
in UTF8, without BOM, or encoding it in ASCII.
For URLs that contain non ASCII characters, I suggest either using UTF8, which is fine in most cases, or use URL-encode to represent all of the characters in ASCII.
Take a look at Wikipedia's robots.txt
file - it's UTF8 encoded.
See references:
- http://hakre.wordpress.com/2010/07/20/encoding-of-the-robots-txt-file/
- http://www.bing.com/blogs/site_blogs/b/webmaster/archive/2009/11/05/robots-speaking-many-languages.aspx
- http://vincentwehren.com/2011/04/09/robots-txt-utf-8-and-the-utf-8-signature/
- http://www.seroundtable.com/archives/017801.html
I suggest you to use ANSI, because if your robots.txt is saved as UTF-8, then it will be marked as faulty in Google's Search Console due to the Unicode Byte Order Mark that's added to its beginning (as mentioned from Roland Illig above).