Question

CGI.unescapeHTML seems to unescape characters that cannot be expressed literally in HTML, such as "<", which is escaped in HTML as "&lt;":

require "cgi"
CGI.unescapeHTML("&lt;") # => "<"

But when it comes to characters that can be either expressed literally or not, it does not seem to unescape it. For example, "§" can be expressed in HTML also as "&sect;", and the latter is not unescaped by this method:

CGI.unescape("&sect;") # => "&sect;"
  1. Is this a feature? Is there any way to completely unescape HTML strings including these characters?
  2. I can find description about CGI.escapeHTML, CGI.unescapeHTML in RDoc for Ruby 1.9.3, but I cannot find it for the newest Ruby. What happended to them? Are they depricated, or was there any change around these methods?
Was it helpful?

Solution 2

I found out that this question was almost a duplicate of this one, which has a good answer that refers to htmlentities gem.

OTHER TIPS

You will find the documentation for newer Ruby versions in CGI::Util. CGI::Util also defines a constant with special characters and their escaped values. That list is pretty short:

> CGI::Util::TABLE_FOR_ESCAPE_HTML__
{
    "'"  => "&#39;",
    "&"  => "&amp;",
    "\"" => "&quot;",
    "<"  => "&lt;",
    ">"  => "&gt;"
}

Looking into the implementation of unescapeHTML you will find some more replacements depending on the charset of the string:

# File lib/cgi/util.rb, line 43
def unescapeHTML(string)
  return string unless string.include? '&'
  enc = string.encoding
  if enc != Encoding::UTF_8 && [Encoding::UTF_16BE, Encoding::UTF_16LE, Encoding::UTF_32BE, Encoding::UTF_32LE].include?(enc)
    return string.gsub(Regexp.new('&(apos|amp|quot|gt|lt|#[0-9]+|#x[0-9A-Fa-f]+);'.encode(enc))) do
      case $1.encode(Encoding::US_ASCII)
      when 'apos'                then "'".encode(enc)
      when 'amp'                 then '&'.encode(enc)
      when 'quot'                then '"'.encode(enc)
      when 'gt'                  then '>'.encode(enc)
      when 'lt'                  then '<'.encode(enc)
      when /\A#0*(\d+)\z/        then $1.to_i.chr(enc)
      when /\A#x([0-9a-f]+)\z/i  then $1.hex.chr(enc)
      end
    end
  end
  asciicompat = Encoding.compatible?(string, "a")
  string.gsub(/&(apos|amp|quot|gt|lt|\#[0-9]+|\#[xX][0-9A-Fa-f]+);/) do
    match = $1.dup
    case match
    when 'apos'                then "'"
    when 'amp'                 then '&'
    when 'quot'                then '"'
    when 'gt'                  then '>'
    when 'lt'                  then '<'
    when /\A#0*(\d+)\z/
      n = $1.to_i
      if enc == Encoding::UTF_8 or
        enc == Encoding::ISO_8859_1 && n < 256 or
        asciicompat && n < 128
        n.chr(enc)
      else
        "&##{$1};"
      end
    when /\A#x([0-9a-f]+)\z/i
      n = $1.hex
      if enc == Encoding::UTF_8 or
        enc == Encoding::ISO_8859_1 && n < 256 or
        asciicompat && n < 128
        n.chr(enc)
      else
        "&#x#{$1};"
      end
    else
      "&#{match};"
    end
  end
end

So: Yes, it only unescapes a subset.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top