تقليم سلسلة إلى طول تجاهل HTML

https://stackoverflow.com/questions/736155

09-09-2019
|

سؤال

هذه المشكلة هي واحدة صعبة. تطبيقنا يسمح للمستخدمين بنشر الأخبار على الصفحة الرئيسية. هذا الأخبار مدخل عبر محرر نصي غني يتيح HTML. على الصفحة الرئيسية، نريد عرض ملخص مقطوع فقط للبند الأخبار.

على سبيل المثال، هنا النص الكامل الذي نعرضه، بما في ذلك HTML

في محاولة لجعل مساحة أكبر قليلا في المكتب، المطبخ، لقد انسحبت جميع الأقداح العشوائية ووضعها على طاولة غرفة الغداء. ما لم تشعر بشدة بملكية أن قدح ساعي شايان من عام 1992 أو ربما أن قدح للاتصالات المتقدمة قبل الميلاد من عام 1997، فسيتم وضعها في صندوق وتبرع بمكتب في حاجة أكثر من الأقداح أكثر منا.

نريد أن تقليم عنصر الأخبار إلى 250 حرفا، ولكن استبعاد HTML.

الأسلوب الذي نستخدمه للتقليم يتضمن حاليا HTML، وهذا يؤدي إلى بعض المشاركات الأخبار التي يتم اقتطاعها HTML الثقيلة بشكل كبير.

على سبيل المثال، إذا شمل المثال أعلاه طن من HTML، فمن المحتمل أن تبدو مثل هذا:

في محاولة لجعل مساحة أكبر قليلا في المكتب، المطبخ، لقد سحبت ...

ليس هذا ما نريد.

هل لدى أي شخص وسيلة لتخفيف علامات HTML من أجل الحفاظ على الموضع في السلسلة، وأداء فحص الطول و / أو تقليم على السلسلة، واستعادة HTML داخل السلسلة في موقعه القديم؟

المحلول

ابدأ في الطابع الأول للنشر، يخطو كل حرف. في كل مرة تقوم فيها بخطوة شخصية، زيادة عداد. عندما تجد حرفا "<"، توقف عن زيادة العداد حتى تضغط على حرف ">". موقفك عندما يحصل العداد على 250 هو المكان الذي ترغب فيه بالفعل في قطعه.

لاحظ أن هذا سيكون له مشكلة أخرى يجب عليك التعامل معها عند فتح علامة HTML ولكنها غير مغلقة قبل القطع.

نصائح أخرى

بعد اقتراح الجهاز المحدد 2-State، قمت فقط بتطوير محلل HTML بسيط لهذا الغرض، في Java:

http://pastebin.com/jcrqiwnh.

وهنا حالة الاختبار:

http://pastebin.com/37gcs4tv.

وهنا رمز جافا:

import java.util.Collections;
import java.util.LinkedList;
import java.util.List;

public class HtmlShortener {

    private static final String TAGS_TO_SKIP = "br,hr,img,link";
    private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
    private static final int STATUS_READY = 0;

        private int cutPoint = -1;
    private String htmlString = "";

    final List<String> tags = new LinkedList<String>();

    StringBuilder sb = new StringBuilder("");
    StringBuilder tagSb = new StringBuilder("");

    int charCount = 0;
    int status = STATUS_READY;

    public HtmlShortener(String htmlString, int cutPoint){
        this.cutPoint = cutPoint;
        this.htmlString = htmlString;
    }

    public String cut(){

        // reset 
        tags.clear();
        sb = new StringBuilder("");
        tagSb = new StringBuilder("");
        charCount = 0;
        status = STATUS_READY;

        String tag = "";

        if (cutPoint < 0){
            return htmlString;
        }

        if (null != htmlString){

            if (cutPoint == 0){
                return "";
            }

            for (int i = 0; i < htmlString.length(); i++){

                String strC = htmlString.substring(i, i+1);


                if (strC.equals("<")){

                    // new tag or tag closure

                    // previous tag reset
                    tagSb = new StringBuilder("");
                    tag = "";

                    // find tag type and name
                    for (int k = i; k < htmlString.length(); k++){

                        String tagC = htmlString.substring(k, k+1);
                        tagSb.append(tagC);

                        if (tagC.equals(">")){
                            tag = getTag(tagSb.toString());
                            if (tag.startsWith("/")){

                                // closure
                                if (!isToSkip(tag)){
                                    sb.append("</").append(tags.get(tags.size() - 1)).append(">");
                                    tags.remove((tags.size() - 1));
                                }

                            } else {

                                // new tag
                                sb.append(tagSb.toString());

                                if (!isToSkip(tag)){
                                    tags.add(tag);  
                                }

                            }

                            i = k;
                            break;
                        }

                    }

                } else {

                    sb.append(strC);
                    charCount++;

                }

                // cut check
                if (charCount >= cutPoint){

                    // close previously open tags
                    Collections.reverse(tags);
                    for (String t : tags){
                        sb.append("</").append(t).append(">");
                    }
                    break;
                } 

            }

            return sb.toString();

        } else {
            return null;
        }

    }

    private boolean isToSkip(String tag) {

        if (tag.startsWith("/")){
            tag = tag.substring(1, tag.length());
        }

        for (String tagToSkip : tagsToSkip){
            if (tagToSkip.equals(tag)){
                return true;
            }
        }

        return false;
    }

    private String getTag(String tagString) {

        if (tagString.contains(" ")){
            // tag with attributes
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
        } else {
            // simple tag
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
        }


    }

}

إذا فهمت المشكلة بشكل صحيح، فأنت تريد الاحتفاظ بتنسيق HTML، لكنك تريد عدم حسابها كجزء من طول السلسلة التي تحتفظ بها.

يمكنك إنجاز هذا مع التعليمات البرمجية التي تنفذ بسيطة آلة الدولة المحدودة.

2 الولايات: Intag، Outoftag
Intag:
- يذهب إلى Outoftag إذا > واجهت الشخصية
- يذهب إلى حد ذاته يتم مصادفة أي شخصية أخرى
Outoftag:
- يذهب إلى intag إذا < واجهت الشخصية
- يذهب إلى حد ذاته يتم مصادفة أي شخصية أخرى

ستكون حالة البداية الخاصة بك outoftag.

يمكنك تنفيذ آلة حالة محددة من خلال Procesing 1 حرف في وقت واحد. تجلب لك معالجة كل حرف لحالة جديدة.

أثناء تشغيل النص الخاص بك من خلال آلة الحالة المحدودة، تريد أيضا الاحتفاظ بمخزن مؤقت للإخراج وطولها حتى الآن واجهت variable (حتى تعرف متى تتوقف).

زيادة متغير طولك في كل مرة تكون فيها في الولاية Outoftag وتعالج شخصية أخرى. يمكنك اختياريا عدم زيادة هذا المتغير إذا كان لديك حرف Whitespace.
يمكنك إنهاء الخوارزمية عندما لا يكون لديك المزيد من الأحرف أو لديك الطول المطلوب المذكور في # 1.
في المخزن المؤقت الإخراج الخاص بك، قم بتضمين الأحرف التي تواجهها حتى الطول المذكور في # 1.
الحفاظ على كومة من العلامات غير المغلقة. عند الوصول إلى الطول، لكل عنصر في المكدس، أضف علامة النهاية. أثناء تشغيلها من خلال خوارزمية، يمكنك معرفة متى تواجه علامة عن طريق الحفاظ على متغير CHALL_TAG. يتم تشغيل متغير Current_tag هذا عند إدخال الحالة Intag، ويتم إنهاءه عند إدخال حالة Outoftag (أو عند مواجهة حرف Whitepsace أثناء وجوده أثناء وجوده في حالة Intag). إذا كانت لديك علامة البدء، فستضعها في المكدس. إذا كان لديك علامة إنهاء، فستبثقها من المكدس.

إليك التنفيذ الذي توصلت إليه، في C #:

public static string TrimToLength(string input, int length)
{
  if (string.IsNullOrEmpty(input))
    return string.Empty;

  if (input.Length <= length)
    return input;

  bool inTag = false;
  int targetLength = 0;

  for (int i = 0; i < input.Length; i++)
  {
    char c = input[i];

    if (c == '>')
    {
      inTag = false;
      continue;
    }

    if (c == '<')
    {
      inTag = true;
      continue;
    }

    if (inTag || char.IsWhiteSpace(c))
    {
      continue;
    }

    targetLength++;

    if (targetLength == length)
    {
      return ConvertToXhtml(input.Substring(0, i + 1));
    }
  }

  return input;
}

وبعض اختبارات الوحدة التي استخدمتها عبر TDD:

[Test]
public void Html_TrimReturnsEmptyStringWhenNullPassed()
{
  Assert.That(Html.TrimToLength(null, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsEmptyStringWhenEmptyPassed()
{
  Assert.That(Html.TrimToLength(string.Empty, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsUnmodifiedStringWhenSameAsLength()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                  "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                  "<br/>" +
                  "In an attempt to make a bit more space in the office, kitchen, I";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(source));
}

[Test]
public void Html_TrimWellFormedHtml()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
             "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
             "<br/>" +
             "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
             "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>" +
             "</div>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                    "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                    "<br/>" +
                    "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(expected));
}

[Test]
public void Html_TrimMalformedHtml()
{
  string malformedHtml = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                         "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                         "<br/>" +
                         "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
                         "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
              "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
              "<br/>" +
              "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(malformedHtml, 250), Is.EqualTo(expected));
}

أنا أدرك أن هذا قليلا بعد التاريخ المنشور، لكن لدي مشكلة مماثلة وهذا هو كيفية انتهائي من حلها. قلقي سيكون سرعة Regex مقابل التوجيه من خلال صفيف.

أيضا إذا كان لديك مساحة قبل علامة HTML، وبعد هذا لا إصلاح ذلك

private string HtmlTrimmer(string input, int len)
{
    if (string.IsNullOrEmpty(input))
        return string.Empty;
    if (input.Length <= len)
        return input;

    // this is necissary because regex "^"  applies to the start of the string, not where you tell it to start from
    string inputCopy;
    string tag;

    string result = "";
    int strLen = 0;
    int strMarker = 0;
    int inputLength = input.Length;     

    Stack stack = new Stack(10);
    Regex text = new Regex("^[^<&]+");                
    Regex singleUseTag = new Regex("^<[^>]*?/>");            
    Regex specChar = new Regex("^&[^;]*?;");
    Regex htmlTag = new Regex("^<.*?>");

    while (strLen < len)
    {
        inputCopy = input.Substring(strMarker);
        //If the marker is at the end of the string OR 
        //the sum of the remaining characters and those analyzed is less then the maxlength
        if (strMarker >= inputLength || (inputLength - strMarker) + strLen < len)
            break;

        //Match regular text
        result += text.Match(inputCopy,0,len-strLen);
        strLen += result.Length - strMarker;
        strMarker = result.Length;

        inputCopy = input.Substring(strMarker);
        if (singleUseTag.IsMatch(inputCopy))
            result += singleUseTag.Match(inputCopy);
        else if (specChar.IsMatch(inputCopy))
        {
            //think of &nbsp; as 1 character instead of 5
            result += specChar.Match(inputCopy);
            ++strLen;
        }
        else if (htmlTag.IsMatch(inputCopy))
        {
            tag = htmlTag.Match(inputCopy).ToString();
            //This only works if this is valid Markup...
            if(tag[1]=='/')         //Closing tag
                stack.Pop();
            else                    //not a closing tag
                stack.Push(tag);
            result += tag;
        }
        else    //Bad syntax
            result += input[strMarker];

        strMarker = result.Length;
    }

    while (stack.Count > 0)
    {
        tag = stack.Pop().ToString();
        result += tag.Insert(1, "/");
    }
    if (strLen == len)
        result += "...";
    return result;
}

يمكنك تجربة حزمة NPM التالية

تقليم - HTML

تخفض النص الكافي داخل علامات HTML، وحفظ مضافة HTML الأصلية، وإزالة علامات HTML بعد الوصول إلى الحد الأدنى وإغلاق العلامات المفتوحة.

لن تكون أسرع طريقة لاستخدام jQuery's text() طريقة؟

علي سبيل المثال:

<ul>
  <li>One</li>
  <li>Two</li>
  <li>Three</li>
</ul>

var text = $('ul').text();

سوف تعطي القيمة Onetwothree في text عامل. هذا من شأنه أن يسمح لك بالحصول على الطول الفعلي للنص دون تضمين HTML.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow