HTMLを無視して文字列をトリミングする長さ

https://stackoverflow.com/questions/736155

09-09-2019
|

質問

この問題は難しい問題です。私たちのアプリケーションを使用すると、ユーザーはホームページにニュースを投稿できます。そのニュースは、HTML を使用できるリッチテキストエディターを介して入力されます。ホームページでは、ニュース項目の切り詰められた概要のみを表示したいと考えています。

たとえば、HTML を含む表示されている全文は次のとおりです。

オフィスとキッチンにもう少しスペースを作ろうと、ランダムなマグカップをすべて取り出してランチルームのテーブルに置きました。 1992 年のシャイアンクーリエのマグカップや、おそらく 1997 年の BC Tel Advanced Communications のマグカップの所有権に強い思いを抱いていない限り、それらは箱に入れられて、私たちよりもマグカップを必要としているオフィスに寄付されるでしょう。

ニュース項目を 250 文字にトリミングしたいと考えていますが、HTML は除外します。

現在、トリミングに使用している方法には HTML が含まれており、その結果、HTML を使用する一部のニュース投稿が大幅に切り捨てられます。

たとえば、上記の例に大量の HTML が含まれている場合、次のようになる可能性があります。

オフィスとキッチンにもう少しスペースを作ろうと思って、...

これは私たちが望んでいることではありません。

文字列内の位置を維持し、文字列の長さチェックやトリミングを実行し、文字列内の HTML を古い場所に復元するために、HTML タグをトークン化する方法を持っている人はいますか?

解決

各文字の上にステッピング、ポストの最初の文字で開始します。あなたは、文字の上にステップするたびに、カウンタをインクリメント。あなたが見つけた場合、文字「<」文字をあなたがヒットするまで、カウンタをインクリメント停止「>」を。あなたが実際に遮断したい場所カウンタが250になるあなたの位置である。

これはあなたがHTMLタグが開かれたが、カット前にクローズされていない場合に対処する必要があります別の問題がありますのでご注意ください。

他のヒント

2ステートの有限機械の提案に続いて、私はJavaで、この目的のために、単純なHTMLパーサを開発しました。

http://pastebin.com/jCRqiwNHする

、ここでテストケースます：

http://pastebin.com/37gCS4tVする

そして、ここでのJavaコード：

import java.util.Collections;
import java.util.LinkedList;
import java.util.List;

public class HtmlShortener {

    private static final String TAGS_TO_SKIP = "br,hr,img,link";
    private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
    private static final int STATUS_READY = 0;

        private int cutPoint = -1;
    private String htmlString = "";

    final List<String> tags = new LinkedList<String>();

    StringBuilder sb = new StringBuilder("");
    StringBuilder tagSb = new StringBuilder("");

    int charCount = 0;
    int status = STATUS_READY;

    public HtmlShortener(String htmlString, int cutPoint){
        this.cutPoint = cutPoint;
        this.htmlString = htmlString;
    }

    public String cut(){

        // reset 
        tags.clear();
        sb = new StringBuilder("");
        tagSb = new StringBuilder("");
        charCount = 0;
        status = STATUS_READY;

        String tag = "";

        if (cutPoint < 0){
            return htmlString;
        }

        if (null != htmlString){

            if (cutPoint == 0){
                return "";
            }

            for (int i = 0; i < htmlString.length(); i++){

                String strC = htmlString.substring(i, i+1);


                if (strC.equals("<")){

                    // new tag or tag closure

                    // previous tag reset
                    tagSb = new StringBuilder("");
                    tag = "";

                    // find tag type and name
                    for (int k = i; k < htmlString.length(); k++){

                        String tagC = htmlString.substring(k, k+1);
                        tagSb.append(tagC);

                        if (tagC.equals(">")){
                            tag = getTag(tagSb.toString());
                            if (tag.startsWith("/")){

                                // closure
                                if (!isToSkip(tag)){
                                    sb.append("</").append(tags.get(tags.size() - 1)).append(">");
                                    tags.remove((tags.size() - 1));
                                }

                            } else {

                                // new tag
                                sb.append(tagSb.toString());

                                if (!isToSkip(tag)){
                                    tags.add(tag);  
                                }

                            }

                            i = k;
                            break;
                        }

                    }

                } else {

                    sb.append(strC);
                    charCount++;

                }

                // cut check
                if (charCount >= cutPoint){

                    // close previously open tags
                    Collections.reverse(tags);
                    for (String t : tags){
                        sb.append("</").append(t).append(">");
                    }
                    break;
                } 

            }

            return sb.toString();

        } else {
            return null;
        }

    }

    private boolean isToSkip(String tag) {

        if (tag.startsWith("/")){
            tag = tag.substring(1, tag.length());
        }

        for (String tagToSkip : tagsToSkip){
            if (tagToSkip.equals(tag)){
                return true;
            }
        }

        return false;
    }

    private String getTag(String tagString) {

        if (tagString.contains(" ")){
            // tag with attributes
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
        } else {
            // simple tag
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
        }


    }

}

私が問題を正しく理解していれば、HTML の書式設定は保持したいが、保持する文字列の長さの一部としてカウントしたくないということになります。

これは、単純なコードを実装するコードで実現できます。有限状態マシン.

2 つの状態:タグ内、タグ外
タグ内:
- 次の場合は OutOfTag に進みます > キャラクターに遭遇した
- 他の文字に遭遇すると、自分自身に戻ります
タグ外:
- 次の場合に InTag に移動します < キャラクターに遭遇した
- 他の文字に遭遇すると、自分自身に戻ります

開始状態は OutOfTag になります。

一度に 1 文字を処理することで、有限ステートマシンを実装します。それぞれの文字を処理することで、新しい状態に到達します。

テキストを有限状態マシンで実行するときは、出力バッファーと、それまでに検出された長さを変数に保持しておきたいと考えます (これにより、いつ停止するかを知ることができます)。

OutOfTag 状態になって別の文字を処理するたびに、Length 変数をインクリメントします。空白文字がある場合は、オプションでこの変数をインクリメントしないようにすることもできます。
文字がなくなるか、#1 で述べた必要な長さに達したら、アルゴリズムを終了します。
出力バッファには、#1 で説明した長さまでの文字を含めます。
閉じられていないタグのスタックを保持します。指定した長さに達したら、スタック内の要素ごとに終了タグを追加します。アルゴリズムを実行するときに、current_tag 変数を保持することで、いつタグに遭遇したかを知ることができます。この current_tag 変数は、InTag 状態に入ると開始され、OutOfTag 状態に入ると (または InTag 状態で白文字に遭遇すると) 終了します。開始タグがある場合は、それをスタックに置きます。終了タグがある場合は、それをスタックからポップします。

ここで私が思いついた実装はC＃で、います：

public static string TrimToLength(string input, int length)
{
  if (string.IsNullOrEmpty(input))
    return string.Empty;

  if (input.Length <= length)
    return input;

  bool inTag = false;
  int targetLength = 0;

  for (int i = 0; i < input.Length; i++)
  {
    char c = input[i];

    if (c == '>')
    {
      inTag = false;
      continue;
    }

    if (c == '<')
    {
      inTag = true;
      continue;
    }

    if (inTag || char.IsWhiteSpace(c))
    {
      continue;
    }

    targetLength++;

    if (targetLength == length)
    {
      return ConvertToXhtml(input.Substring(0, i + 1));
    }
  }

  return input;
}

そして、私はTDDを経由して使用し、いくつかのユニットテスト：

[Test]
public void Html_TrimReturnsEmptyStringWhenNullPassed()
{
  Assert.That(Html.TrimToLength(null, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsEmptyStringWhenEmptyPassed()
{
  Assert.That(Html.TrimToLength(string.Empty, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsUnmodifiedStringWhenSameAsLength()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                  "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                  "<br/>" +
                  "In an attempt to make a bit more space in the office, kitchen, I";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(source));
}

[Test]
public void Html_TrimWellFormedHtml()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
             "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
             "<br/>" +
             "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
             "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>" +
             "</div>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                    "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                    "<br/>" +
                    "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(expected));
}

[Test]
public void Html_TrimMalformedHtml()
{
  string malformedHtml = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                         "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                         "<br/>" +
                         "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
                         "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
              "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
              "<br/>" +
              "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(malformedHtml, 250), Is.EqualTo(expected));
}

私は、これはかなり投稿日後で認識してんだけど、私は似問題があったし、これは私がそれを解決することになった方法です。私の関心は、配列をinterating対正規表現の速度になります。

また、あなたは、htmlタグの前にスペースがある場合、これはその問題を解決しません。

後

private string HtmlTrimmer(string input, int len)
{
    if (string.IsNullOrEmpty(input))
        return string.Empty;
    if (input.Length <= len)
        return input;

    // this is necissary because regex "^"  applies to the start of the string, not where you tell it to start from
    string inputCopy;
    string tag;

    string result = "";
    int strLen = 0;
    int strMarker = 0;
    int inputLength = input.Length;     

    Stack stack = new Stack(10);
    Regex text = new Regex("^[^<&]+");                
    Regex singleUseTag = new Regex("^<[^>]*?/>");            
    Regex specChar = new Regex("^&[^;]*?;");
    Regex htmlTag = new Regex("^<.*?>");

    while (strLen < len)
    {
        inputCopy = input.Substring(strMarker);
        //If the marker is at the end of the string OR 
        //the sum of the remaining characters and those analyzed is less then the maxlength
        if (strMarker >= inputLength || (inputLength - strMarker) + strLen < len)
            break;

        //Match regular text
        result += text.Match(inputCopy,0,len-strLen);
        strLen += result.Length - strMarker;
        strMarker = result.Length;

        inputCopy = input.Substring(strMarker);
        if (singleUseTag.IsMatch(inputCopy))
            result += singleUseTag.Match(inputCopy);
        else if (specChar.IsMatch(inputCopy))
        {
            //think of &nbsp; as 1 character instead of 5
            result += specChar.Match(inputCopy);
            ++strLen;
        }
        else if (htmlTag.IsMatch(inputCopy))
        {
            tag = htmlTag.Match(inputCopy).ToString();
            //This only works if this is valid Markup...
            if(tag[1]=='/')         //Closing tag
                stack.Pop();
            else                    //not a closing tag
                stack.Push(tag);
            result += tag;
        }
        else    //Bad syntax
            result += input[strMarker];

        strMarker = result.Length;
    }

    while (stack.Count > 0)
    {
        tag = stack.Pop().ToString();
        result += tag.Insert(1, "/");
    }
    if (strLen == len)
        result += "...";
    return result;
}

次のNPMパッケージを試すことができます。

トリム-HTML の

これは、HTMLタグ内の十分なテキストを遮断限界に達すると終了タグを開いた後に、HTMLタグを削除し、元のHTML狭窄を保存します。

最速の方法は、jQueryのtext()メソッドを使用することではないでしょうか。

例

<ul>
  <li>One</li>
  <li>Two</li>
  <li>Three</li>
</ul>

var text = $('ul').text();

text変数に値OneTwoThreeを与えるだろう。これは、HTMLが含まれずに、テキストの実際の長さを取得できるようになります。

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow