使用htmlagilitypack划分文档

https://stackoverflow.com/questions/3517322

29-09-2019
|

题

我想知道这是否可能。

我的html很喜欢：

<p>
  <font face="Georgia">
    <b>History</b><br>&nbsp; <br>Two of the polysaccharides used in the manufacture of...</font>
    <a title="PubMed" href="http://www.www.gov/pubmed/" target="_blank">
    <font face="Georgia">) and this web site for new development by...well as Self Affirmed Medical Food GRAS status.&nbsp; 
    </font>
</p>

<p>
  <font face="Georgia">[READMORE]</font>
</p>

<p><font face="Georgia"><br><strong>Proprietary Composition</strong><br>
   <br>The method in which soluble fibres are made into... REST OF ARTICLE...
</p>

是的，这很丑陋的HTML，它来自Wysiwyg，所以我对此几乎没有控制。

我想做的是搜索 阅读更多 在文档中，删除任何父标签（在这种情况下， <font> 和 <p> 标签）并用ReadMore链接替换它们，同时将文档的其余部分包裹在巨型`...其余文章...的其余部分...

我很确定htmlagilitypack会让我成为其中的一部分，但是我只是想弄清楚从哪里开始。

到目前为止，我很确定我必须使用 htmlDoc.DocumentNode.SelectSingleNode(//p[text()="[READMORE]"]) 或者其他的东西。我对XPath不太熟悉。

对于我的文档，读取可能在嵌套 font 标签。

同样，在某些情况下，它可能根本不在标签中，而是在文档根上。我可以在这种情况下进行定期搜索并替换，应该简单地进行。

我的理想情况就是这样（伪代码）

var node = SelectNodeContaining("[READMORE]").

node.Replace( "link here" );

node.RestOfDocument().Wrap("<div class='wrapper'");

我知道，我在做梦...但是我希望这是有道理的。

解决方案

这是XSLT解决方案:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="p[descendant::text()[. = '[READMORE]']]">
  <a href="#ReadmoreWrapper">READMORE</a>
  <div class="wrapper" id="#ReadmoreWrapper">
   <xsl:apply-templates select="following-sibling::node()" mode="copy"/>
  </div>
 </xsl:template>

 <xsl:template match=
  "node()[ancestor::p[descendant::text()[. = '[READMORE]']]
         or
          preceding::p[descendant::text()[. = '[READMORE]']]
          ]
  "/>

  <xsl:template match="node()|@*" mode="copy">
      <xsl:copy>
       <xsl:apply-templates select="node()|@*" mode="copy"/>
      </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

当在以下XML文档上应用此转换时:

<html>
<p>
  <font face="Georgia">
    <b>History</b><br/>&#xA0; <br/>Two of the polysaccharides used in the manufacture of...</font>
    <a title="PubMed" href="http://www.www.gov/pubmed/" target="_blank"/>
    <font face="Georgia">) and this web site for new development by...well as Self Affirmed Medical Food GRAS status.&#xA0;
    </font>
</p>

<p>
  <font face="Georgia">[READMORE]</font>
</p>

<p><font face="Georgia"><br/><strong>Proprietary Composition</strong><br/>
   <br/>The method in which soluble fibres are made into... REST OF ARTICLE...
   </font>
</p>

</html>

产生了通缉结果:

<html>
    <p>
        <font face="Georgia"><b>History</b><br/>  <br/>Two of the polysaccharides used in the manufacture of...</font>
        <a title="PubMed" href="http://www.www.gov/pubmed/" target="_blank"/>
        <font face="Georgia">) and this web site for new development by...well as Self Affirmed Medical Food GRAS status. 
    </font>
    </p>
    <a href="#ReadmoreWrapper">READMORE</a>
    <div class="wrapper" id="#ReadmoreWrapper">
        <p>
            <font face="Georgia"><br/><strong>Proprietary Composition</strong><br/><br/>The method in which soluble fibres are made into... REST OF ARTICLE...
   </font>
        </p>
    </div>
</html>

其他提示

如果我是对的，那么您可以尝试一件事...就像我们在发送自定义html邮件时所做的相同的事情

使用静态内容创建HTML页面的模板。
如您所述[ReadMore]或{ReadMore}或类似的内容，为动态内容附加标识符。
现在，通过行读取模板HTML文件，并用所需的文本替换标识符。
现在，将整个字符串保存到新的HTML文件或做任何您想做的事情。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow