Pergunta

Eu tenho um wysiwyg em um site.O problema é que os usuários estão copiando e colando muitos dados nele, deixando muitas tags div não fechadas e formatadas incorretamente que estão quebrando o layout do site.

Existe uma maneira fácil de remover todas as ocorrências de <div> e </div>?

str_replace não funcionará porque alguns dos divs têm estilos e outras coisas neles, então seria necessário levar em conta <div style="some styling"> <div align="center"> etc.

Suponho que isso poderia ser feito com uma expressão regular, mas sou totalmente iniciante quando se trata disso.

Foi útil?

Solução

Não.Você faz NÃO sempre analise/manipule HTML com expressões regulares.

Regexes não podem ser negociados.Eles não podem ser fundamentados.Eles não entendem html, não entendem xml.E eles vontade absoluta NÃO pare até que sua árvore DOM esteja morta.

Você usa purificador html e/ou DOM manipular a árvore.

Outras dicas

Melhor usar DOM para analisador HTML, mas se você não tiver escolha a não ser usar RegEx, poderá usá-lo assim:

$patterns = array();
$patterns[0] = '/<div[^>]*>/';
$patterns[1] = '/<\/div>/';
$replacements = array();
$replacements[2] = '';
$replacements[1] = '';
echo preg_replace($patterns, $replacements, $html);

Aqui está um exemplo simplificado de como você poderia fazer isso com PHP

    <?php
    /**
     * Removes the divs because why not
     */
    function strip_divs(&$text, $id = 'html') {
      $replacements = array();
      worker($text, $replacements, $id);

      foreach ($replacements as $key => $val) {
        $text = mb_str_replace($key, $val, $text);
      }

      return $text;
    }

    function worker(&$body, &$replacements, $id) {
      static $call_count;
      if (empty($call_count)) {
        $call_count = array();
      }
      if (empty($call_count[$id])) {
        $call_count[$id] = 0;
      }

      if (mb_strpos($body, '</div>')) {
        $body = mb_str_replace('</div>', '', $body);
      }

      if (mb_strpos($body, '<di') !== FALSE) {
        $call_count[$id] ++;
        // Gets the important junk
        $rm               = '<di' . xml_get($body, '<di', '>') . '>';
        // Builds the replacements HTML
        $replacement_html = '';

        $next_id                       = count($replacements);
        $replacement_id                = "[[div-$next_id]]";
        $replacements[$replacement_id] = $replacement_html;

        $body = mb_str_replace($rm, $replacement_id, $body);

        if (mb_strpos($body, '<di') !== FALSE && $call_count[$id] < 200) {
          worker($body, $replacements, $id);
        }
      }
    }


    /**
     * Returns text by specifying a start and end point
     *
     * @param str $str
     *   The text to search
     * @param str $start
     *   The beginning identifier
     * @param str $end
     *   The ending identifier
     */
    function xml_get($str, $start, $end) {
      $str = "|" . $str . "|";
      $len = mb_strlen($start);
      if (mb_strpos($str, $start) > 0) {
        $int_start = mb_strpos($str, $start) + $len;
        $temp      = right($str, (mb_strlen($str) - $int_start));
        $int_end   = mb_strpos($temp, $end);
        $return    = trim(left($temp, $int_end));
        return $return;
      }
      else {
        return FALSE;
      }
    }

    function right($str, $count) {
      return mb_substr($str, ($count * -1));
    }

    function left($str, $count) {
      return mb_substr($str, 0, $count);
    }

    /**
     * Multibyte str replace
     */
    if (!function_exists('mb_str_replace')) {

      function mb_str_replace($search, $replace, $subject, &$count = 0) {
        if (!is_array($subject)) {
          $searches     = is_array($search) ? array_values($search) : array($search);
          $replacements = is_array($replace) ? array_values($replace) : array($replace);
          $replacements = array_pad($replacements, count($searches), '');
          foreach ($searches as $key => $search) {
            $parts   = mb_split(preg_quote($search), $subject);
            $count += count($parts) - 1;
            $subject = implode($replacements[$key], $parts);
          }
        }
        else {
          foreach ($subject as $key => $value) {
            $subject[$key] = mb_str_replace($search, $replace, $value, $count);
          }
        }
        return $subject;
      }

    }

    $html = <<<HTML
    <table>
        <tbody>
            <tr>
                <td class="votecell">
                    <div class="vote">
                        <input type="hidden" name="_id_" value="9607101">
                        <a class="vote-up-off" title="This question shows research effort; it is useful and clear">up vote</a>
                        <span itemprop="upvoteCount" class="vote-count-post ">0</span>
                        <a class="vote-down-off" title="This question does not show any research effort; it is unclear or not useful">down vote</a>
                        <a class="star-off" href="#">favorite</a>
                        <div class="favoritecount"><b></b></div>
                    </div>
                </td>
                <td class="postcell">
                    <div>
                        <div class="post-text" itemprop="text">
                            <p>I have a wysiwyg on a site. The problem is that the users are copy pasting a lot of data in to it leaving a lot of unclosed and improperly formatted div tags that are breaking the site layout. </p>
                            <p>Is there an easy an easy way to strip all occurrences of <code>&lt;div&gt;</code> and <code>&lt;/div&gt;</code>?</p>
                            <p>str_replace won't work because some of the divs have styling and other things in them so it would need to account for <code>&lt;div style="some styling"&gt; &lt;div align="center"&gt;</code> etc</p>
                            <p>I'm guessing this could be done with a regular expression but I am total a total beginner when it comes to those. </p>
                            <p>Thanks a lot,
                                Martin
                            </p>
                        </div>
                        <div class="post-taglist">
                            <a href="/questions/tagged/php" class="post-tag js-gps-track" title="show questions tagged 'php'" rel="tag">php</a> <a href="/questions/tagged/regex" class="post-tag js-gps-track" title="show questions tagged 'regex'" rel="tag">regex</a> <a href="/questions/tagged/replace" class="post-tag js-gps-track" title="show questions tagged 'replace'" rel="tag">replace</a> <a href="/questions/tagged/str-replace" class="post-tag js-gps-track" title="" rel="tag">str-replace</a> <a href="/questions/tagged/strip-tags" class="post-tag js-gps-track" title="show questions tagged 'strip-tags'" rel="tag">strip-tags</a>
                        </div>
                        <table class="fw">
                            <tbody>
                                <tr>
                                    <td class="vt">
                                        <div class="post-menu"><a href="/q/9607101" title="short permalink to this question" class="short-link" id="link-post-9607101">share</a><span class="lsep">|</span><a href="/posts/9607101/edit" class="suggest-edit-post" title="">improve this question</a></div>
                                    </td>
                                    <td align="right" class="post-signature">
                                        <div class="user-info ">
                                            <div class="user-action-time">
                                                <a href="/posts/9607101/revisions" title="show all edits to this post">edited <span title="2012-03-07 18:32:29Z" class="relativetime">Mar 7 '12 at 18:32</span></a>
                                            </div>
                                            <div class="user-gravatar32">
                                            </div>
                                            <div class="user-details">
                                                <div class="-flair">
                                                </div>
                                            </div>
                                        </div>
                                    </td>
                                    <td class="post-signature owner">
                                        <div class="user-info ">
                                            <div class="user-action-time">
                                                asked <span title="2012-03-07 18:31:11Z" class="relativetime">Mar 7 '12 at 18:31</span>
                                            </div>
                                            <div class="user-gravatar32">
                                                <a href="/users/702826/martin-hunt">
                                                    <div class="gravatar-wrapper-32"><img src="https://www.gravatar.com/avatar/a578c3eae229c86dbe46d4b1603e071b?s=32&amp;d=identicon&amp;r=PG" alt="" width="32" height="32"></div>
                                                </a>
                                            </div>
                                            <div class="user-details">
                                                <a href="/users/702826/martin-hunt">Martin Hunt</a>
                                                <div class="-flair">
                                                    <span class="reputation-score" title="reputation score " dir="ltr">313</span><span title="7 silver badges"><span class="badge2"></span><span class="badgecount">7</span></span><span title="20 bronze badges"><span class="badge3"></span><span class="badgecount">20</span></span>
                                                </div>
                                            </div>
                                        </div>
                                    </td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                </td>
            </tr>
            <tr>
                <td class="votecell"></td>
                <td>
                    <div id="comments-9607101" class="comments ">
                        <table>
                            <tbody data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true">
                                <tr id="comment-12187969" class="comment ">
                                    <td class="comment-actions">
                                        <table>
                                            <tbody>
                                                <tr>
                                                    <td class=" comment-score">
                                                        <span title="number of 'useful comment' votes received" class="cool">1</span>
                                                    </td>
                                                    <td>
                                                        &nbsp;
                                                    </td>
                                                </tr>
                                            </tbody>
                                        </table>
                                    </td>
                                    <td class="comment-text">
                                        <div style="display: block;" class="comment-body">
                                            <span class="comment-copy">So you need to remove all the div tags but not the content between the div. Am I right?</span>
                                            –&nbsp;<a href="/users/500725/siva-charan" title="14,075 reputation" class="comment-user">Siva Charan</a>
                                            <span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12187969_9607101"><span title="2012-03-07 18:34:11Z" class="relativetime-clean">Mar 7 '12 at 18:34</span></a></span>
                                        </div>
                                    </td>
                                </tr>
                                <tr id="comment-12189778" class="comment ">
                                    <td>
                                        <table>
                                            <tbody>
                                                <tr>
                                                    <td class=" comment-score">
                                                        &nbsp;&nbsp;
                                                    </td>
                                                    <td>
                                                        &nbsp;
                                                    </td>
                                                </tr>
                                            </tbody>
                                        </table>
                                    </td>
                                    <td class="comment-text">
                                        <div style="display: block;" class="comment-body">
                                            <span class="comment-copy"><a href="http://stackoverflow.com/a/4667535/208809">Replace the XPath with <code>//div[not[@*]]</code></a> to remove all div elements (incl. content) without attributes.</span>
                                            –&nbsp;<a href="/users/208809/gordon" title="225,421 reputation" class="comment-user">Gordon</a>
                                            <span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12189778_9607101"><span title="2012-03-07 19:58:21Z" class="relativetime-clean">Mar 7 '12 at 19:58</span></a></span>
                                            <span class="edited-yes" title="this comment was edited 2 times"></span>
                                        </div>
                                    </td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                    <div id="comments-link-9607101" data-rep="50" data-anon="true">
                        <a class="js-add-link comments-link disabled-link " title="Use comments to ask for more information or suggest improvements. Avoid answering questions in comments.">add a comment</a><span class="js-link-separator dno">&nbsp;|&nbsp;</span>
                        <a class="js-show-link comments-link dno" title="expand to show all comments on this post" href="#" onclick=""></a>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
    HTML;

    echo strip_divs($html);
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top