How to transform encoded URL to readable texts?

https://stackoverflow.com/questions/18683371

28-06-2022
|

题

It's about Bangla Unicode texts, but can be a problem for any language other than Latin glyphs.
I'm a host of a Bangla blog with all its texts and categories in Bangla ^{(I prefer not to say Bengali as because the name of the language is Bangla rather than Bengali)}.

So the category in Bangla "বাংলা" saying a URL like:
http://www.example.com/category/বাংলা

But whenever I copied the URL from address bar and put 'em into a chat panel or somewhere else, it changed with some strange characters, for example:
http://www.example.com/category/%E0%A6%B8%E0%A7%8D%E0%A6%A8%E0^*

_{* it's just an example, not the exact gibberish for the word "বাংলা")}

So, in many cases I got some encoded URLs like above, from where I found no trace which Unicode text they are saying. Recently I'm getting some 404 error logged by one of my plugin. From there I found a URI like:

/category/%E0%A6%B8%E0%A7%8D%E0%A6%A8%E0%A6%BE%E0%A7%9F%E0%A7%81%E0%A6%AC%E0%A6%BF%E0%A6%A6%E0%A7%8D%E0%A6%AF%E0

I used the Jetpack's Omnisearch to find out any match, but the result is empty. I can't even trace which category that is— creating such a 404.

So here comes the question:

How can I transform the encoded URL to readable glyphs?

解决方案

http://www.example.com/category/বাংলা

isn't a URL; URLs can only contain ASCII characters. This is an IRI.

http://www.example.com/category/%E0%A6%AC%E0%A6%BE%E0%A6%82%E0%A6%B2%E0%A6%BE

is the URI representation of that IRI. They are otherwise equivalent. A browser may display the ‘pretty’ IRI version in the user interface, but put the URI version on the clipboard so that you can paste it into other tools that don't support IRI.

The 404 address you pasted translates to:

/category/স্নায়ুবিদ্য�

where the last character is a � because it is an invalid, truncated UTF-8 sequence. (This is probably why the request failed.) Someone may have mis-pasted a partial URI here.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow