Вопрос

I'm using php extension tidy-html to clean up php output. I know that tidy removes invalid tags and can't even handle HTML5 doctype, but I'm using tag <menu> which used to be in HTML specifications. However, it gets changed for <ul> anyway.

Oddly enough, It didn't do so before. I changed the tidy config and it has break. Now I've turned off all options that messes with tags, but it didn't help.

My script is quite verbose:

$tidy_config = array(
    'char-encoding' => 'utf8',
    'output-encoding' => 'utf8',
    'output-html' => true,
    'numeric-entities' => false,
    'ascii-chars' => false,
    'doctype' => 'loose',
    'clean' => false,
    'bare' => false,
    'fix-uri' => true,
    'indent' => true,
    'indent-spaces' => 2,
    'tab-size' => 2,
    'wrap-attributes' => true,
    'wrap' => 0,
    'indent-attributes' => true,
    'join-classes' => false,
    'join-styles' => false,
    'fix-bad-comments' => true,
    'fix-backslash' => true,
    'replace-color' => false,
    'wrap-asp' => false,
    'wrap-jste' => false,
    'wrap-php' => false,
    'wrap-sections' => false,
    'drop-proprietary-attributes' => false,
    'hide-comments' => false,
    'hide-endtags' => false,
    'drop-empty-paras' => true,
    'quote-ampersand' => true,
    'quote-marks' => true,
    'quote-nbsp' => true,
    'vertical-space' => true,
    'wrap-script-literals' => false,
    'tidy-mark' => true,
    'merge-divs' => false,
    'repeated-attributes' => 'keep-last',
    'break-before-br' => false
);

$tidy_config2 = array(
    'tidy-mark' => false,
    'vertical-space' => false,
    'hide-comments' => true,
    'indent-spaces' => 0,
    'tab-size' => 1,
    'wrap-attributes' => false,
    'numeric-entities' => true,
    'ascii-chars' => true,
    'hide-endtags' => true,
    'indent' => false
);
$tidy_config = array_merge($tidy_config, $tidy_config2);

$dtm = preg_match(self::doctypeMatch, $output, $dt);
$output = tidy_repair_string($output, $tidy_config, 'utf8');

// tidy screws up doctype --fixed
if($dtm)
    $output = preg_replace(self::doctypeMatch, $dt[0], $output);

$output = preg_replace('!>[\n\r]+<!', '><', $output);

unset($tidy_config);

return $output;

Note that it is more complicated than this (hence the two arrays). I've just cut off unnecessary code.

Это было полезно?

Решение

DISCLAIMER:

I don't think my answer is very... neat. It's more of a hakish way to use HTMLTidy with HTML5 (which currently it does not support). To accomplish that I use regex to parse HTML, which, according to most, is the the root of all evil or the cthulhu way. If someone knows a better way, please enlighten us, since I don't feel very secure in using regex to parse html. I've tested it with many examples but I'm quite sure it's not bullet proof.

Intro

The menu tag was deprecated in HTML4 and XHTML1, being replaced by ul (unordered list). It was, however, redefined in HTML5 and hence is a valid tag according to HTML5 specifications. SinceHTMLTidy does not support HTML5 and uses XHTML or HTML specifications, as the OP pointed, it replaces the then deprecated tag menu to ul (or adds the ul tag), even when you specifically tell it not to.

My suggestion

This function replaces the menu tag with a custom tag prior to parsing it with tidy. It then replaces the custom tag with menu again.

function tidyHTML5($buffer)
{
    $buffer = str_replace('<menu', '<mytag', $buffer);
    $buffer = str_replace('menu>', 'mytag>', $buffer);
    $tidy = new tidy();
    $options = array(
            'hide-comments'         => true,
            'tidy-mark'             => false,
            'indent'                => true,
            'indent-spaces'         => 4,
            'new-blocklevel-tags'   => 'menu,mytag,article,header,footer,section,nav',
            'new-inline-tags'       => 'video,audio,canvas,ruby,rt,rp',
            'doctype'               => '<!DOCTYPE HTML>',
            //'sort-attributes'     => 'alpha',
            'vertical-space'        => false,
            'output-xhtml'          => true,
            'wrap'                  => 180,
            'wrap-attributes'       => false,
            'break-before-br'       => false,
            'char-encoding'         => 'utf8',
            'input-encoding'        => 'utf8',
            'output-encoding'       => 'utf8'
    );

    $tidy->parseString($buffer, $options, 'utf8');
    $tidy->cleanRepair();

    $html = '<!DOCTYPE HTML>' . PHP_EOL . $tidy->html();
    $html = str_replace('<html lang="en" xmlns="http://www.w3.org/1999/xhtml">', '<html>', $html);
    $html = str_replace('<html xmlns="http://www.w3.org/1999/xhtml">', '<html>', $html);

    //Hackish stuff starts here
    //We use regex to parse html, which is usually a bad idea
    //But currently there is no alternative to it, since tidy is not MENU TAG friendly
    preg_match_all('/\<mytag(?:[^\>]*)\>\s*\<ul>/', $html, $matches);
    foreach($matches as $m) {
        $mo = $m;
        $m = str_replace('mytag', 'menu', $m);
        $m = str_replace('<ul>', '', $m);
        $html = str_replace($mo, $m, $html);
    }
    $html = str_replace('<mytag', '<menu', $html);
    $html = str_replace('</ul></mytag>', '</menu>', $html);
    $html = str_replace('mytag>', 'menu>', $html);
    return $html;
}

TEST:

header("Content-type: text/plain");
echo tidyHTML5('<menu><li>Lorem ipsum</li></menu><div></div><menu   ><a href="#">lala</a><form id="jj"><button>btn</button></form></menu><menu style="color: white" id="nhecos"><li>blabla</li><li>sdfsdfsdf</li></menu>');

OUTPUT:

<!DOCTYPE HTML>
<html>
    <head>
        <title></title>
    </head>
    <body>
        <menu>

            <li>Lorem ipsum
            </li>
        </menu><menu style="color: white" id="nhecos">

            <li>blabla
            </li>
            <li>sdfsdfsdf
            </li>
        </menu>
    </body>
</html>

Другие советы

According to W3C tidy-html5 fork, the correct configuration for new tags should be:

'new-blocklevel-tags' => 'article aside audio bdi canvas details dialog figcaption figure footer header hgroup main menu menuitem nav section source summary template track video',
'new-empty-tags' => 'command embed keygen source track wbr',
'new-inline-tags' => 'audio command datalist embed keygen mark menuitem meter output progress source time video wbr',

You'll notice that new-blocklevel-tags has a weird temp tag defined, that's supposed to be the replacement for the old obsolete menu tag, as @tivie mentioned in his answer you have to replace it.

Also, the tags audio and video appear in both new-blocklevel-tags and new-inline-tags, and that changes the way tidy will output the HTML, as it is:

<video src="movie.webm">
<track kind="subtitles" label="English" src="subtitles.vtt" srclang="en"></video>

If you drop video from the new-inline-tags:

<video src="movie.webm">
  <track kind="subtitles" label="English" src="subtitles.vtt" srclang="en">
</video>

Dropping video from new-blocklevel-tags yields:

<video src="movie.webm">
<track kind="subtitles" label="English" src="subtitles.vtt" srclang="en"></video>

Personally, I prefer audio and video to behave like block level tags, but that's up to you.

Additionally, the tags.c also defines command as being CM_HEAD and embed as CM_IMG. Unfortunately, I have no idea what these stand for and I don't think it's possible to emulate them.

One other thing: if you don't define new-empty-tags, you'll get weird outputs:

<video src="movie.webm">
  <track kind="subtitles" label="English" src="subtitles.vtt" srclang="en">
  </track>
</video>

Addendum

If you also want to support the WHATWG recommendation, you should add the tags:


Here's my complete approach:

function Tidy5($string, $options = null, $encoding = 'utf8')
{
   if (extension_loaded('tidy') === true)
   {
      $default = array
      (
         'anchor-as-name' => false,
         'break-before-br' => true,
         'char-encoding' => $encoding,
         'decorate-inferred-ul' => false,
         'doctype' => 'omit',
         'drop-empty-paras' => false,
         'drop-font-tags' => true,
         'drop-proprietary-attributes' => false,
         'force-output' => false,
         'hide-comments' => false,
         'indent' => true,
         'indent-attributes' => false,
         'indent-spaces' => 2,
         'input-encoding' => $encoding,
         'join-styles' => false,
         'logical-emphasis' => false,
         'merge-divs' => false,
         'merge-spans' => false,
         'new-blocklevel-tags' => 'article aside audio bdi canvas details dialog figcaption figure footer header hgroup main menu menuitem nav section source summary template track video',
         'new-empty-tags' => 'command embed keygen source track wbr',
         'new-inline-tags' => 'audio command datalist embed keygen mark menuitem meter output progress source time video wbr',
         'newline' => 0,
         'numeric-entities' => false,
         'output-bom' => false,
         'output-encoding' => $encoding,
         'output-html' => true,
         'preserve-entities' => true,
         'quiet' => true,
         'quote-ampersand' => true,
         'quote-marks' => false,
         'repeated-attributes' => 1,
         'show-body-only' => true,
         'show-warnings' => false,
         'sort-attributes' => 1,
         'tab-size' => 4,
         'tidy-mark' => false,
         'vertical-space' => true,
         'wrap' => 0,
      );

      $doctype = $menu = null;

      if ((strncasecmp($string, '<!DOCTYPE', 9) === 0) || (strncasecmp($string, '<html', 5) === 0))
      {
         $doctype = '<!DOCTYPE html>'; $options['show-body-only'] = false;
      }

      $options = (is_array($options) === true) ? array_merge($default, $options) : $default;

      if (strpos($string, '<menu') !== false)
      {
         $menu = array
         (
            '<menu' => '<menutidy',
            '</menu' => '</menutidy',
         );
      }

      if (isset($menu) === true)
      {
         $string = str_replace(array_keys($menu), $menu, $string);
      }

      $string = tidy_repair_string($string, $options, $encoding);

      if (empty($string) !== true)
      {
         if (isset($menu) === true)
         {
            $string = str_replace($menu, array_keys($menu), $string);
         }

         if (isset($doctype) === true)
         {
            $string = $doctype . "\n" . $string;
         }

         return $string;
      }
   }

   return false;
}
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top