Usando php substr () e strip_tags () enquanto mantém a formatação e sem quebrar html

https://stackoverflow.com/questions/2398725

25-09-2019
|

Pergunta

Eu tenho várias cordas HTML para cortar para 100 caracteres (do conteúdo despojado, não o original) sem remover tags e sem quebrar o HTML.

String html original (288 caracteres):

$content = "<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the air
<span>everywhere</span>, it's a HTML taggy kind of day.</strong></div>";

Encontro padrão: Aparar com 100 caracteres e quebras HTML, o conteúdo despojado chega a ~ 40 caracteres:

$content = substr($content, 0, 100)."..."; /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove... */

HTML despojado: Saídas contagem correta de caracteres, mas obviamente perde a formatação:

$content = substr(strip_tags($content)), 0, 100)."..."; /* output:
With a span over here and a nested div over there and a lot of other nested
texts and tags in the ai... */

Solução parcial: Usando o HTML Tidy ou Purifier para fechar as tags de tags limpe o HTML, mas 100 caracteres de HTML não exibiram conteúdo.

$content = substr($content, 0, 100)."...";
$tidy = new tidy; $tidy->parseString($content); $tidy->cleanRepair(); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove</div></div>... */

Desafio: Para produzir HTML limpo e n caracteres (excluindo a contagem de caracteres dos elementos HTML):

$content = cutHTML($content, 100); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the
ai</strong></div>...";

Perguntas semelhantes

Solução

Não é incrível, mas funciona.

function html_cut($text, $max_length)
{
    $tags   = array();
    $result = "";

    $is_open   = false;
    $grab_open = false;
    $is_close  = false;
    $in_double_quotes = false;
    $in_single_quotes = false;
    $tag = "";

    $i = 0;
    $stripped = 0;

    $stripped_text = strip_tags($text);

    while ($i < strlen($text) && $stripped < strlen($stripped_text) && $stripped < $max_length)
    {
        $symbol  = $text{$i};
        $result .= $symbol;

        switch ($symbol)
        {
           case '<':
                $is_open   = true;
                $grab_open = true;
                break;

           case '"':
               if ($in_double_quotes)
                   $in_double_quotes = false;
               else
                   $in_double_quotes = true;

            break;

            case "'":
              if ($in_single_quotes)
                  $in_single_quotes = false;
              else
                  $in_single_quotes = true;

            break;

            case '/':
                if ($is_open && !$in_double_quotes && !$in_single_quotes)
                {
                    $is_close  = true;
                    $is_open   = false;
                    $grab_open = false;
                }

                break;

            case ' ':
                if ($is_open)
                    $grab_open = false;
                else
                    $stripped++;

                break;

            case '>':
                if ($is_open)
                {
                    $is_open   = false;
                    $grab_open = false;
                    array_push($tags, $tag);
                    $tag = "";
                }
                else if ($is_close)
                {
                    $is_close = false;
                    array_pop($tags);
                    $tag = "";
                }

                break;

            default:
                if ($grab_open || $is_close)
                    $tag .= $symbol;

                if (!$is_open && !$is_close)
                    $stripped++;
        }

        $i++;
    }

    while ($tags)
        $result .= "</".array_pop($tags).">";

    return $result;
}

Exemplo de uso:

$content = html_cut($content, 100);

Outras dicas

Não estou alegando ter inventado isso, mas há um muito completo Text::truncate() Método em Cakephp o que faz o que você quer:

function truncate($text, $length = 100, $ending = '...', $exact = true, $considerHtml = false) {
    if (is_array($ending)) {
        extract($ending);
    }
    if ($considerHtml) {
        if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
            return $text;
        }
        $totalLength = mb_strlen($ending);
        $openTags = array();
        $truncate = '';
        preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
        foreach ($tags as $tag) {
            if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
                if (preg_match('/<[\w]+[^>]*>/s', $tag[0])) {
                    array_unshift($openTags, $tag[2]);
                } else if (preg_match('/<\/([\w]+)[^>]*>/s', $tag[0], $closeTag)) {
                    $pos = array_search($closeTag[1], $openTags);
                    if ($pos !== false) {
                        array_splice($openTags, $pos, 1);
                    }
                }
            }
            $truncate .= $tag[1];

            $contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
            if ($contentLength + $totalLength > $length) {
                $left = $length - $totalLength;
                $entitiesLength = 0;
                if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
                    foreach ($entities[0] as $entity) {
                        if ($entity[1] + 1 - $entitiesLength <= $left) {
                            $left--;
                            $entitiesLength += mb_strlen($entity[0]);
                        } else {
                            break;
                        }
                    }
                }

                $truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
                break;
            } else {
                $truncate .= $tag[3];
                $totalLength += $contentLength;
            }
            if ($totalLength >= $length) {
                break;
            }
        }

    } else {
        if (mb_strlen($text) <= $length) {
            return $text;
        } else {
            $truncate = mb_substr($text, 0, $length - strlen($ending));
        }
    }
    if (!$exact) {
        $spacepos = mb_strrpos($truncate, ' ');
        if (isset($spacepos)) {
            if ($considerHtml) {
                $bits = mb_substr($truncate, $spacepos);
                preg_match_all('/<\/([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
                if (!empty($droppedTags)) {
                    foreach ($droppedTags as $closingTag) {
                        if (!in_array($closingTag[1], $openTags)) {
                            array_unshift($openTags, $closingTag[1]);
                        }
                    }
                }
            }
            $truncate = mb_substr($truncate, 0, $spacepos);
        }
    }

    $truncate .= $ending;

    if ($considerHtml) {
        foreach ($openTags as $tag) {
            $truncate .= '</'.$tag.'>';
        }
    }

    return $truncate;
}

Use PHP's Domdocument classe para normalizar um fragmento HTML:

$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');      
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));

Esta questão é semelhante a um pergunta anterior E eu copiei e colei uma solução aqui. Se o HTML for enviado pelos usuários, você também precisará filtrar possíveis vetores de ataque JavaScript como onmouseover="do_something_evil()" ou <a href="javascript:more_evil();">...</a>. Ferramentas como Purificador HTML foram projetados para capturar e resolver esses problemas e são muito mais abrangentes do que qualquer código que eu pudesse postar.

Use um Analisador HTML e pare após 100 caracteres de texto.

Você deveria usar Tidy html. Você corta a corda e depois corra para fechar as tags.

(Créditos onde os créditos são devido)

Eu fiz outra função para fazê-lo, ele suporta UTF-8:

/**
 * Limit string without break html tags.
 * Supports UTF8
 * 
 * @param string $value
 * @param int $limit Default 100
 */
function str_limit_html($value, $limit = 100)
{

    if (mb_strwidth($value, 'UTF-8') <= $limit) {
        return $value;
    }

    // Strip text with HTML tags, sum html len tags too.
    // Is there another way to do it?
    do {
        $len          = mb_strwidth($value, 'UTF-8');
        $len_stripped = mb_strwidth(strip_tags($value), 'UTF-8');
        $len_tags     = $len - $len_stripped;

        $value = mb_strimwidth($value, 0, $limit + $len_tags, '', 'UTF-8');
    } while ($len_stripped > $limit);

    // Load as HTML ignoring errors
    $dom = new DOMDocument();
    @$dom->loadHTML('<?xml encoding="utf-8" ?>'.$value, LIBXML_HTML_NODEFDTD);

    // Fix the html errors
    $value = $dom->saveHtml($dom->getElementsByTagName('body')->item(0));

    // Remove body tag
    $value = mb_strimwidth($value, 6, mb_strwidth($value, 'UTF-8') - 13, '', 'UTF-8'); // <body> and </body>
    // Remove empty tags
    return preg_replace('/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/', '', $value);
}

Veja demonstração.

Eu recomendo o uso html_entity_decode No início da função, ele preserva os caracteres UTF-8:

 $value = html_entity_decode($value);

Independentemente das 100 questões de contagem que você afirma no início, você indica no desafio o seguinte:

Saia a contagem de caracteres de Strip_tags (o número de caracteres no texto exibido real do HTML)
reter a formatação HTML fechada
qualquer tag html inacabada

Aqui está a minha proposta: basicamente, analiso através de cada personagem contando como vou. Eu certifico -me de não contar com nenhum caractere em nenhuma tag HTML. Eu também verifico no final para ter certeza de que não estou no meio de uma palavra quando paro. Depois de parar, trato de volta para o primeiro espaço disponível ou> como ponto de parada.

$position = 0;
$length = strlen($content)-1;

// process the content putting each 100 character section into an array
while($position < $length)
{
    $next_position = get_position($content, $position, 100);
    $data[] = substr($content, $position, $next_position);
    $position = $next_position;
}

// show the array
print_r($data);

function get_position($content, $position, $chars = 100)
{
    $count = 0;
    // count to 100 characters skipping over all of the HTML
    while($count <> $chars){
        $char = substr($content, $position, 1); 
        if($char == '<'){
            do{
                $position++;
                $char = substr($content, $position, 1);
            } while($char !== '>');
            $position++;
            $char = substr($content, $position, 1);
        }
        $count++;
        $position++;
    }
echo $count."\n";
    // find out where there is a logical break before 100 characters
    $data = substr($content, 0, $position);

    $space = strrpos($data, " ");
    $tag = strrpos($data, ">");

    // return the position of the logical break
    if($space > $tag)
    {
        return $space;
    } else {
        return $tag;
    }  
}

Isso também contará os códigos de retorno etc. Considerando que eles ocuparão espaço, eu não os removi.

Aqui está uma função que estou usando em um dos meus projetos. É baseado no DomDocument, trabalha com HTML5 e é cerca de 2x mais rápido do que outras soluções que eu tentei (pelo menos na minha máquina, 0,22 ms vs 0,43 ms usando html_cut($text, $max_length) Da resposta superior em uma string de 500 caracteres de texto com um limite de 400).

function cut_html ($html, $limit) {
    $dom = new DOMDocument();
    $dom->loadHTML(mb_convert_encoding("<div>{$html}</div>", "HTML-ENTITIES", "UTF-8"), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    cut_html_recursive($dom->documentElement, $limit);
    return substr($dom->saveHTML($dom->documentElement), 5, -6);
}

function cut_html_recursive ($element, $limit) {
    if($limit > 0) {
        if($element->nodeType == 3) {
            $limit -= strlen($element->nodeValue);
            if($limit < 0) {
                $element->nodeValue = substr($element->nodeValue, 0, strlen($element->nodeValue) + $limit);
            }
        }
        else {
            for($i = 0; $i < $element->childNodes->length; $i++) {
                if($limit > 0) {
                    $limit = cut_html_recursive($element->childNodes->item($i), $limit);
                }
                else {
                    $element->removeChild($element->childNodes->item($i));
                    $i--;
                }
            }
        }
    }
    return $limit;
}

Aqui está minha tentativa no cortador. Talvez vocês possam pegar alguns bugs. O problema, encontrei com os outros analisadores é que eles não fecham as tags corretamente e cortam no meio de uma palavra (blá)

function cutHTML($string, $length, $patternsReplace = false) {
    $i = 0;
    $count = 0;
    $isParagraphCut = false;
    $htmlOpen = false;
    $openTag = false;
    $tagsStack = array();

    while ($i < strlen($string)) {
        $char = substr($string, $i, 1);
        if ($count >= $length) {
            $isParagraphCut = true;
            break;
        }

        if ($htmlOpen) {
            if ($char === ">") {
                $htmlOpen = false;
            }
        } else {
            if ($char === "<") {
                $j = $i;
                $char = substr($string, $j, 1);

                while ($j < strlen($string)) {
                    if($char === '/'){
                        $i++;
                        break;
                    }
                    elseif ($char === ' ') {
                        $tagsStack[] = substr($string, $i, $j);
                    }
                    $j++;
                }
                $htmlOpen = true;
            }
        }

        if (!$htmlOpen && $char != ">") {
            $count++;
        }

        $i++;
    }

    if ($isParagraphCut) {
        $j = $i;
        while ($j > 0) {
            $char = substr($string, $j, 1);
            if ($char === " " || $char === ";" || $char === "." || $char === "," || $char === "<" || $char === "(" || $char === "[") {
                break;
            } else if ($char === ">") {
                $j++;
                break;
            }
            $j--;
        }
        $string = substr($string, 0, $j);
        foreach($tagsStack as $tag){
            $tag = strtolower($tag);
            if($tag !== "img" && $tag !== "br"){
                $string .= "</$tag>";
            }
        }
        $string .= "...";
    }

    if ($patternsReplace) {
        foreach ($patternsReplace as $value) {
            if (isset($value['pattern']) && isset($value["replace"])) {
                $string = preg_replace($value["pattern"], $value["replace"], $string);
            }
        }
    }
    return $string;
}

Experimente esta função

// trim the string function
function trim_word($text, $length, $startPoint=0, $allowedTags=""){
    $text = html_entity_decode(htmlspecialchars_decode($text));
    $text = strip_tags($text, $allowedTags);
    return $text = substr($text, $startPoint, $length);
}

echo trim_word("<h2 class='zzzz'>abcasdsdasasdas</h2>","6");

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow