Question

I see that many sites (amazon, wikipedia, others) use UTF8-encoded, URL-escaped unicode in their URLs, and those URLs are prettified by (at least) Chrome.

For example, we would represent http://ja.wikipedia.org/wiki/メインページ as http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8 when writing our http headers, and Chrome and Firefox seem to understand this in a graceful way. (I didn't test on IE.)

Is there a governing standard for this behavior? Or is it strictly a de facto standard? Or is it completely non-standard?

I'd really like to see a link to the defining paragraph of some RFC.

No correct solution

OTHER TIPS

The URI standard says:

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded.

That seems pretty definitive.

I'm still unsure about when it was ratified, or the current browser support.

RFC 3987 is the new standard for handling International URI/URLs, known as IRIs. The old standard, RFC 3986, does not support Unicode. Anyone not using IRIs yet has to come up with their own way of encoding unsupported characters for their own needs. Percent-encoding UTF-8 octets is one way, but it is certainly not the only way that is actually in use.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top