Pergunta

Now I'm working on Wikipedia. In many articles, I noticed some URLs, for example, https://www.google.com/search?q=%26%E0%B8%89%E0%B8%B1%E0%B8%99, are very long. The example URL can be replaced with "https://www.google.com/search?q=%26ฉัน" (ฉัน is a Thai word) which is shorter and cleaner. However, when I use urllib.unquote function to decode URL, it decodes even %26 and get "https://www.google.com/search?q=&ฉัน" as the result. As you might have noticed, this URL is useless; it doesn't make a valid link.

Therefore, I want to know how to get decode link while it is valid. I think that decoding only non-ascii character would get the valid URL. Is it correct? and how to do that?

Thanks :)

Foi útil?

Solução

Easiest way, you can replace all URL encode sequence below %80 (%00-%7F) with some placeholder, do a URL decode, and replace the original URL encode sequence back into the placeholder.

Another way is look for UTF-8 sequences. Your URL appears to be encoded in UTF-8, and Wikipedia uses UTF-8. You can see the Wikipedia entry for UTF-8 for how UTF-8 characters are encoded.

So, when encoded in URLs, each valid non-ascii UTF-8 character would follow one of these patterns:

  • (%C0-%DF)(%80-%BF)
  • (%E0-%EF)(%80-%BF)(%80-%BF)
  • (%F0-%F7)(%80-%BF)(%80-%BF)(%80-%BF)
  • (%F8-%FB)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)
  • (%FC-%FD)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)

So you can match these patterns in the URL and unquote each character separately.


However, remember that not all URLs are encoded in UTF-8.

In some old websites, they still use other character sets, such as Windows-874 for Thai language.

In such cases, "ฉัน" for that particular website is encoded as "%A9%D1%B9" instead of "%E0%B8%89%E0%B8%B1%E0%B8%99". If you decode it using urllib.unquote you will get some garbled text like "?ѹ" instead of "ฉัน" and that could break the link.

So you have to be careful and check if the URL decoding break the link or not. Make sure that the URL you're decoding is in UTF-8.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top