質問

CGI.unescapeHTML seems to unescape characters that cannot be expressed literally in HTML, such as "<", which is escaped in HTML as "&lt;":

require "cgi"
CGI.unescapeHTML("&lt;") # => "<"

But when it comes to characters that can be either expressed literally or not, it does not seem to unescape it. For example, "§" can be expressed in HTML also as "&sect;", and the latter is not unescaped by this method:

CGI.unescape("&sect;") # => "&sect;"
  1. Is this a feature? Is there any way to completely unescape HTML strings including these characters?
  2. I can find description about CGI.escapeHTML, CGI.unescapeHTML in RDoc for Ruby 1.9.3, but I cannot find it for the newest Ruby. What happended to them? Are they depricated, or was there any change around these methods?
役に立ちましたか?

解決 2

I found out that this question was almost a duplicate of this one, which has a good answer that refers to htmlentities gem.

他のヒント

You will find the documentation for newer Ruby versions in CGI::Util. CGI::Util also defines a constant with special characters and their escaped values. That list is pretty short:

> CGI::Util::TABLE_FOR_ESCAPE_HTML__
{
    "'"  => "&#39;",
    "&"  => "&amp;",
    "\"" => "&quot;",
    "<"  => "&lt;",
    ">"  => "&gt;"
}

Looking into the implementation of unescapeHTML you will find some more replacements depending on the charset of the string:

# File lib/cgi/util.rb, line 43
def unescapeHTML(string)
  return string unless string.include? '&'
  enc = string.encoding
  if enc != Encoding::UTF_8 && [Encoding::UTF_16BE, Encoding::UTF_16LE, Encoding::UTF_32BE, Encoding::UTF_32LE].include?(enc)
    return string.gsub(Regexp.new('&(apos|amp|quot|gt|lt|#[0-9]+|#x[0-9A-Fa-f]+);'.encode(enc))) do
      case $1.encode(Encoding::US_ASCII)
      when 'apos'                then "'".encode(enc)
      when 'amp'                 then '&'.encode(enc)
      when 'quot'                then '"'.encode(enc)
      when 'gt'                  then '>'.encode(enc)
      when 'lt'                  then '<'.encode(enc)
      when /\A#0*(\d+)\z/        then $1.to_i.chr(enc)
      when /\A#x([0-9a-f]+)\z/i  then $1.hex.chr(enc)
      end
    end
  end
  asciicompat = Encoding.compatible?(string, "a")
  string.gsub(/&(apos|amp|quot|gt|lt|\#[0-9]+|\#[xX][0-9A-Fa-f]+);/) do
    match = $1.dup
    case match
    when 'apos'                then "'"
    when 'amp'                 then '&'
    when 'quot'                then '"'
    when 'gt'                  then '>'
    when 'lt'                  then '<'
    when /\A#0*(\d+)\z/
      n = $1.to_i
      if enc == Encoding::UTF_8 or
        enc == Encoding::ISO_8859_1 && n < 256 or
        asciicompat && n < 128
        n.chr(enc)
      else
        "&##{$1};"
      end
    when /\A#x([0-9a-f]+)\z/i
      n = $1.hex
      if enc == Encoding::UTF_8 or
        enc == Encoding::ISO_8859_1 && n < 256 or
        asciicompat && n < 128
        n.chr(enc)
      else
        "&#x#{$1};"
      end
    else
      "&#{match};"
    end
  end
end

So: Yes, it only unescapes a subset.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top