Question

I'm wondering if this is possible.

I have html like so:

<p>
  <font face="Georgia">
    <b>History</b><br>&nbsp; <br>Two of the polysaccharides used in the manufacture of...</font>
    <a title="PubMed" href="http://www.www.gov/pubmed/" target="_blank">
    <font face="Georgia">) and this web site for new development by...well as Self Affirmed Medical Food GRAS status.&nbsp; 
    </font>
</p>

<p>
  <font face="Georgia">[READMORE]</font>
</p>

<p><font face="Georgia"><br><strong>Proprietary Composition</strong><br>
   <br>The method in which soluble fibres are made into... REST OF ARTICLE...
</p>

Yes, it's ugly html and it comes from a WYSIWYG so I have little control over it.

What I want to do is search for [READMORE] in the document, remove any parent tags ( in this case, the <font> and the <p> tags ) and replace them with a readmore link while wrapping the REST of the document in a giant `...rest of article...

I'm pretty sure the HtmlAgilityPack will get me part of the way there, but I'm just trying to figure out where to start.

So far, I'm pretty sure that I have to use htmlDoc.DocumentNode.SelectSingleNode(//p[text()="[READMORE]"]) or something. I'm not too familiar with XPATH.

For my documents, the readmore may or may not be in a nested font tag.

Also, in some cases, it may not be in a tag at all, but rather at the document root. I can just do a regular search and replace in that case and it should be straightforward.

My ideal situation would be something like this (PSEUDOCODE)

var node = SelectNodeContaining("[READMORE]").

node.Replace( "link here" );

node.RestOfDocument().Wrap("<div class='wrapper'");

I know, I'm dreaming... but I hope this makes sense.

Was it helpful?

Solution

Here is an XSLT solution:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="p[descendant::text()[. = '[READMORE]']]">
  <a href="#ReadmoreWrapper">READMORE</a>
  <div class="wrapper" id="#ReadmoreWrapper">
   <xsl:apply-templates select="following-sibling::node()" mode="copy"/>
  </div>
 </xsl:template>

 <xsl:template match=
  "node()[ancestor::p[descendant::text()[. = '[READMORE]']]
         or
          preceding::p[descendant::text()[. = '[READMORE]']]
          ]
  "/>

  <xsl:template match="node()|@*" mode="copy">
      <xsl:copy>
       <xsl:apply-templates select="node()|@*" mode="copy"/>
      </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the following XML document:

<html>
<p>
  <font face="Georgia">
    <b>History</b><br/>&#xA0; <br/>Two of the polysaccharides used in the manufacture of...</font>
    <a title="PubMed" href="http://www.www.gov/pubmed/" target="_blank"/>
    <font face="Georgia">) and this web site for new development by...well as Self Affirmed Medical Food GRAS status.&#xA0;
    </font>
</p>

<p>
  <font face="Georgia">[READMORE]</font>
</p>

<p><font face="Georgia"><br/><strong>Proprietary Composition</strong><br/>
   <br/>The method in which soluble fibres are made into... REST OF ARTICLE...
   </font>
</p>

</html>

the wanted result is produced:

<html>
    <p>
        <font face="Georgia"><b>History</b><br/>  <br/>Two of the polysaccharides used in the manufacture of...</font>
        <a title="PubMed" href="http://www.www.gov/pubmed/" target="_blank"/>
        <font face="Georgia">) and this web site for new development by...well as Self Affirmed Medical Food GRAS status. 
    </font>
    </p>
    <a href="#ReadmoreWrapper">READMORE</a>
    <div class="wrapper" id="#ReadmoreWrapper">
        <p>
            <font face="Georgia"><br/><strong>Proprietary Composition</strong><br/><br/>The method in which soluble fibres are made into... REST OF ARTICLE...
   </font>
        </p>
    </div>
</html>

OTHER TIPS

If i am right then , You can try one thing...as the same thing we do in sending custom html mails

  1. Create a template of your html page with static contents.
  2. Append identifiers for dynamic contents as you have stated [ReadMore] or {ReadmOre} or something similar to that.
  3. Now read the template html file line by line and replace the identifiers with desired text.
  4. Now save the entire string to a new html file or do whatever you want.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top