Recortar la cadena a la longitud ignorando HTML

https://stackoverflow.com/questions/736155

09-09-2019
|

Pregunta

Este problema es desafiante.Nuestra aplicación permite a los usuarios publicar noticias en la página de inicio.Esas noticias se ingresan a través de un editor de texto enriquecido que permite HTML.En la página de inicio queremos mostrar sólo un resumen truncado de la noticia.

Por ejemplo, aquí está el texto completo que mostramos, incluido HTML

En un intento por hacer un poco más de espacio en la oficina y la cocina, saqué todas las tazas al azar y las puse en la mesa del comedor. A menos que esté muy convencido de la propiedad de esa taza de Cheyenne Courier de 1992 o tal vez de esa taza de BC Tel Advanced Communications de 1997, las guardarán en una caja y las donarán a una oficina que necesita más tazas que nosotros.

Queremos recortar la noticia a 250 caracteres, pero excluir HTML.

El método que estamos usando para recortar actualmente incluye HTML, y esto da como resultado que algunas publicaciones de noticias con mucho HTML se trunquen considerablemente.

Por ejemplo, si el ejemplo anterior incluye toneladas de HTML, podría verse así:

En un intento de hacer un poco más de espacio en la oficina, cocina, he tirado...

Esto no es lo que queremos.

¿Alguien tiene una forma de tokenizar etiquetas HTML para mantener la posición en la cadena, realizar una verificación de longitud y/o recortar la cadena y restaurar el HTML dentro de la cadena en su ubicación anterior?

Solución

Comience en el primer carácter de la publicación, pasando por encima de cada carácter.Cada vez que pasas por encima de un personaje, incrementa un contador.Cuando encuentre un carácter '<', deje de incrementar el contador hasta que llegue a un carácter '>'.Tu posición cuando el contador llega a 250 es donde realmente quieres cortar.

Tenga en cuenta que esto tendrá otro problema con el que tendrá que lidiar cuando una etiqueta HTML se abra pero no se cierre antes del corte.

Otros consejos

Después de la máquina finita de 2 estados sugerencia, sólo he desarrollado un simple analizador de HTML para este fin, en Java:

http://pastebin.com/jCRqiwNH

y aquí un caso de prueba:

http://pastebin.com/37gCS4tV

Y aquí el código Java:

import java.util.Collections;
import java.util.LinkedList;
import java.util.List;

public class HtmlShortener {

    private static final String TAGS_TO_SKIP = "br,hr,img,link";
    private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
    private static final int STATUS_READY = 0;

        private int cutPoint = -1;
    private String htmlString = "";

    final List<String> tags = new LinkedList<String>();

    StringBuilder sb = new StringBuilder("");
    StringBuilder tagSb = new StringBuilder("");

    int charCount = 0;
    int status = STATUS_READY;

    public HtmlShortener(String htmlString, int cutPoint){
        this.cutPoint = cutPoint;
        this.htmlString = htmlString;
    }

    public String cut(){

        // reset 
        tags.clear();
        sb = new StringBuilder("");
        tagSb = new StringBuilder("");
        charCount = 0;
        status = STATUS_READY;

        String tag = "";

        if (cutPoint < 0){
            return htmlString;
        }

        if (null != htmlString){

            if (cutPoint == 0){
                return "";
            }

            for (int i = 0; i < htmlString.length(); i++){

                String strC = htmlString.substring(i, i+1);


                if (strC.equals("<")){

                    // new tag or tag closure

                    // previous tag reset
                    tagSb = new StringBuilder("");
                    tag = "";

                    // find tag type and name
                    for (int k = i; k < htmlString.length(); k++){

                        String tagC = htmlString.substring(k, k+1);
                        tagSb.append(tagC);

                        if (tagC.equals(">")){
                            tag = getTag(tagSb.toString());
                            if (tag.startsWith("/")){

                                // closure
                                if (!isToSkip(tag)){
                                    sb.append("</").append(tags.get(tags.size() - 1)).append(">");
                                    tags.remove((tags.size() - 1));
                                }

                            } else {

                                // new tag
                                sb.append(tagSb.toString());

                                if (!isToSkip(tag)){
                                    tags.add(tag);  
                                }

                            }

                            i = k;
                            break;
                        }

                    }

                } else {

                    sb.append(strC);
                    charCount++;

                }

                // cut check
                if (charCount >= cutPoint){

                    // close previously open tags
                    Collections.reverse(tags);
                    for (String t : tags){
                        sb.append("</").append(t).append(">");
                    }
                    break;
                } 

            }

            return sb.toString();

        } else {
            return null;
        }

    }

    private boolean isToSkip(String tag) {

        if (tag.startsWith("/")){
            tag = tag.substring(1, tag.length());
        }

        for (String tagToSkip : tagsToSkip){
            if (tagToSkip.equals(tag)){
                return true;
            }
        }

        return false;
    }

    private String getTag(String tagString) {

        if (tagString.contains(" ")){
            // tag with attributes
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
        } else {
            // simple tag
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
        }


    }

}

Si entiendo correctamente el problema, que desea mantener el formato HTML, pero desea no cuenta como parte de la longitud de la cadena que se está manteniendo.

Esto se puede hacer con el código que implementa un simple href="http://en.wikipedia.org/wiki/Finite_state_machine" finita máquina estado.

2 estados: Intag, OutOfTag
  Intag:
    - Va a OutOfTag si el personaje se encuentra >
    - Va a la misma cualquier otro carácter que se encuentre
  OutOfTag:
    - Va a Intag si el personaje se encuentra <
    - Va a la misma se encuentra cualquier otro carácter

Su estado de partida será OutOfTag.

Se implementa una máquina de estados finitos de Procesamiento, 1 carácter a la vez. El procesamiento de cada personaje que lleva a un nuevo estado.

Al ejecutar su texto a través de la máquina de estados finitos, desea también tener un búfer de salida y una longitud varaible encontrado hasta ahora (para que sepa cuándo parar).

Incrementar la variable de longitud cada vez que se encuentran en el estado OutOfTag y procesar otro personaje. Opcionalmente, puede no incrementar esta variable si tiene un carácter de espacio en blanco.
Se termina el algoritmo cuando no tiene más caracteres o si tiene la longitud deseada mencionado en el # 1.
En el búfer de salida, incluir caracteres que encuentro hasta la longitud mencionada en el # 1.
Mantenga una pila de etiquetas sin cerrar. Al llegar a la longitud, para cada elemento de la pila, añadir una etiqueta final. A medida que se ejecuta a través de su algoritmo se puede saber cuando se encuentra con una etiqueta manteniendo una variable current_tag. Esta variable current_tag se inicia al entrar en el estado de Intag, y se terminó cuando se introduce el estado OutOfTag (o cuando un personaje se encuentra whitepsace mientras que en el estado de Intag). Si usted tiene una etiqueta de inicio que lo puso en la pila. Si usted tiene una etiqueta de cierre, que el pop de la pila.

Aquí está la implementación que se me ocurrió, en C #:

public static string TrimToLength(string input, int length)
{
  if (string.IsNullOrEmpty(input))
    return string.Empty;

  if (input.Length <= length)
    return input;

  bool inTag = false;
  int targetLength = 0;

  for (int i = 0; i < input.Length; i++)
  {
    char c = input[i];

    if (c == '>')
    {
      inTag = false;
      continue;
    }

    if (c == '<')
    {
      inTag = true;
      continue;
    }

    if (inTag || char.IsWhiteSpace(c))
    {
      continue;
    }

    targetLength++;

    if (targetLength == length)
    {
      return ConvertToXhtml(input.Substring(0, i + 1));
    }
  }

  return input;
}

Y un par de pruebas de unidad que he utilizado a través de TDD:

[Test]
public void Html_TrimReturnsEmptyStringWhenNullPassed()
{
  Assert.That(Html.TrimToLength(null, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsEmptyStringWhenEmptyPassed()
{
  Assert.That(Html.TrimToLength(string.Empty, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsUnmodifiedStringWhenSameAsLength()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                  "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                  "<br/>" +
                  "In an attempt to make a bit more space in the office, kitchen, I";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(source));
}

[Test]
public void Html_TrimWellFormedHtml()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
             "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
             "<br/>" +
             "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
             "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>" +
             "</div>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                    "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                    "<br/>" +
                    "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(expected));
}

[Test]
public void Html_TrimMalformedHtml()
{
  string malformedHtml = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                         "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                         "<br/>" +
                         "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
                         "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
              "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
              "<br/>" +
              "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(malformedHtml, 250), Is.EqualTo(expected));
}

Soy consciente de que es un poco después de la fecha de agregación, pero que tenía un problema similar y así es como acabé resolverlo. Mi preocupación sería la velocidad de expresiones regulares frente interating través de una matriz.

Además, si usted tiene un espacio antes de una etiqueta HTML, y después de esto no soluciona que

private string HtmlTrimmer(string input, int len)
{
    if (string.IsNullOrEmpty(input))
        return string.Empty;
    if (input.Length <= len)
        return input;

    // this is necissary because regex "^"  applies to the start of the string, not where you tell it to start from
    string inputCopy;
    string tag;

    string result = "";
    int strLen = 0;
    int strMarker = 0;
    int inputLength = input.Length;     

    Stack stack = new Stack(10);
    Regex text = new Regex("^[^<&]+");                
    Regex singleUseTag = new Regex("^<[^>]*?/>");            
    Regex specChar = new Regex("^&[^;]*?;");
    Regex htmlTag = new Regex("^<.*?>");

    while (strLen < len)
    {
        inputCopy = input.Substring(strMarker);
        //If the marker is at the end of the string OR 
        //the sum of the remaining characters and those analyzed is less then the maxlength
        if (strMarker >= inputLength || (inputLength - strMarker) + strLen < len)
            break;

        //Match regular text
        result += text.Match(inputCopy,0,len-strLen);
        strLen += result.Length - strMarker;
        strMarker = result.Length;

        inputCopy = input.Substring(strMarker);
        if (singleUseTag.IsMatch(inputCopy))
            result += singleUseTag.Match(inputCopy);
        else if (specChar.IsMatch(inputCopy))
        {
            //think of &nbsp; as 1 character instead of 5
            result += specChar.Match(inputCopy);
            ++strLen;
        }
        else if (htmlTag.IsMatch(inputCopy))
        {
            tag = htmlTag.Match(inputCopy).ToString();
            //This only works if this is valid Markup...
            if(tag[1]=='/')         //Closing tag
                stack.Pop();
            else                    //not a closing tag
                stack.Push(tag);
            result += tag;
        }
        else    //Bad syntax
            result += input[strMarker];

        strMarker = result.Length;
    }

    while (stack.Count > 0)
    {
        tag = stack.Pop().ToString();
        result += tag.Insert(1, "/");
    }
    if (strLen == len)
        result += "...";
    return result;
}

Puede probar el siguiente paquete NPM

recortar-html

cortando texto suficiente dentro de las etiquetas HTML, guardar estenosis html original, eliminar las etiquetas HTML después de que se alcanza el límite de cierre y abrió las etiquetas.

No sería la forma más rápida es utilizar el método text() de jQuery?

Por ejemplo:

<ul>
  <li>One</li>
  <li>Two</li>
  <li>Three</li>
</ul>

var text = $('ul').text();

Le daría la OneTwoThree valor en la variable text. Esto permitirá obtener la longitud real del texto sin el HTML incluido.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow