I found out that this question was almost a duplicate of this one, which has a good answer that refers to htmlentities
gem.
Incompleteness of `CGI.unescapeHTML`
Question
CGI.unescapeHTML
seems to unescape characters that cannot be expressed literally in HTML, such as "<"
, which is escaped in HTML as "<"
:
require "cgi"
CGI.unescapeHTML("<") # => "<"
But when it comes to characters that can be either expressed literally or not, it does not seem to unescape it. For example, "§"
can be expressed in HTML also as "§"
, and the latter is not unescaped by this method:
CGI.unescape("§") # => "§"
- Is this a feature? Is there any way to completely unescape HTML strings including these characters?
- I can find description about
CGI.escapeHTML
,CGI.unescapeHTML
in RDoc for Ruby 1.9.3, but I cannot find it for the newest Ruby. What happended to them? Are they depricated, or was there any change around these methods?
Solution 2
OTHER TIPS
You will find the documentation for newer Ruby versions in CGI::Util
. CGI::Util
also defines a constant with special characters and their escaped values. That list is pretty short:
> CGI::Util::TABLE_FOR_ESCAPE_HTML__
{
"'" => "'",
"&" => "&",
"\"" => """,
"<" => "<",
">" => ">"
}
Looking into the implementation of unescapeHTML
you will find some more replacements depending on the charset of the string:
# File lib/cgi/util.rb, line 43
def unescapeHTML(string)
return string unless string.include? '&'
enc = string.encoding
if enc != Encoding::UTF_8 && [Encoding::UTF_16BE, Encoding::UTF_16LE, Encoding::UTF_32BE, Encoding::UTF_32LE].include?(enc)
return string.gsub(Regexp.new('&(apos|amp|quot|gt|lt|#[0-9]+|#x[0-9A-Fa-f]+);'.encode(enc))) do
case $1.encode(Encoding::US_ASCII)
when 'apos' then "'".encode(enc)
when 'amp' then '&'.encode(enc)
when 'quot' then '"'.encode(enc)
when 'gt' then '>'.encode(enc)
when 'lt' then '<'.encode(enc)
when /\A#0*(\d+)\z/ then $1.to_i.chr(enc)
when /\A#x([0-9a-f]+)\z/i then $1.hex.chr(enc)
end
end
end
asciicompat = Encoding.compatible?(string, "a")
string.gsub(/&(apos|amp|quot|gt|lt|\#[0-9]+|\#[xX][0-9A-Fa-f]+);/) do
match = $1.dup
case match
when 'apos' then "'"
when 'amp' then '&'
when 'quot' then '"'
when 'gt' then '>'
when 'lt' then '<'
when /\A#0*(\d+)\z/
n = $1.to_i
if enc == Encoding::UTF_8 or
enc == Encoding::ISO_8859_1 && n < 256 or
asciicompat && n < 128
n.chr(enc)
else
"&##{$1};"
end
when /\A#x([0-9a-f]+)\z/i
n = $1.hex
if enc == Encoding::UTF_8 or
enc == Encoding::ISO_8859_1 && n < 256 or
asciicompat && n < 128
n.chr(enc)
else
"&#x#{$1};"
end
else
"&#{match};"
end
end
end
So: Yes, it only unescapes a subset.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow