Question

I'm using sgmlreader to convert HTML to XML. The output goes into a XmlDocument object, which I can then use the InnerText method to extract the plain text from the website. I'm trying to get the text to look as clean as possible, by removing any javascript. Looping through the xml and removing any <script type="text/javascript"> is easy enough, but I've hit a brick wall when any jquery or styling isn't encapsulated in any tags. Can anybody help me out?

Sample Code:

Step one: Once I use the webclient class to download the HTML, I save it, then open the file with the text reader class.

Step two: Create sgmlreader class and set the input stream to the text reader:

  // setup SGMLReader
            Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
            sgmlReader.DocType = "HTML";
            sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
            sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
            sgmlReader.InputStream = reader;

            // create document
            doc = new XmlDocument();
            doc.PreserveWhitespace = true;
            doc.XmlResolver = null;
            doc.Load(sgmlReader);

Step three: Once I have a xmldocument, I use the doc.InnerText to get my plain text.

Step four: I can easy remove JavaScript tags like so:

 XmlNodeList nodes = document.GetElementsByTagName("text/javascript");

                for (int i = nodes.Count - 1; i >= 0; i--)
                {
                    nodes[i].ParentNode.RemoveChild(nodes[i]);
                }

Some stuff still slips through. Heres an example of an ouput for one particular website I'm scriping:

Criminal and Civil Enforcement | Fraud | Office of Inspector General | U.S. Department of Health and Human Services



#fancybox-right { 
right:-20px; 
} 
#fancybox-left { 
left:-20px; 
} 
#fancybox-right:hover span, #fancybox-right span 
#fancybox-right:hover span, #fancybox-right span { 
left:auto; 
right:0; 
} 
#fancybox-left:hover span, #fancybox-left span 
#fancybox-left:hover span, #fancybox-left span { 
right:auto; 
left:0; 
} 
#fancybox-overlay { 
/* background: url('/connections/images/wc-overlay.png'); */
/* background: url('/connections/images/banner.png') center center no-repeat; */
} 





$(document).ready(function(){

$("a[rel=photo-show]").fancybox({
'titlePosition' : 'over',
'overlayColor' : '#000',
'overlayOpacity' : 0.9
});

$(".title-under").fancybox({
'titlePosition' : 'outside',
'overlayColor' : '#000',
'overlayOpacity' : 0.9
}) 

}); 

That jquery and styling needs to be removed.

Was it helpful?

Solution

I just threw this together in LinqPad based on the html of this page and it properly removes the script and style tags.

void Main()
{
    string htmlPath = @"C:\Users\Jschubert\Desktop\html\test.html";
    var sgmlReader = new Sgml.SgmlReader();
    var stringReader = new StringReader(File.ReadAllText(htmlPath));

    sgmlReader.DocType = "HTML";
    sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
    sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
    sgmlReader.InputStream = stringReader;

    // create document
    var doc = new XmlDocument();
    doc.PreserveWhitespace = true;
    doc.XmlResolver = null;
    doc.Load(sgmlReader);

    List<XmlNode> nodes = doc.GetElementsByTagName("script")
                          .Cast<XmlNode>().ToList();
    var byType = doc.SelectNodes("script[@type = 'text/javascript']")
                          .Cast<XmlNode>().ToList();
    var style = doc.GetElementsByTagName("style").Cast<XmlNode>().ToList();
    nodes.AddRange(byType);
    nodes.AddRange(style);

    for (int i = nodes.Count - 1; i >= 0; i--)
    {
        nodes[i].ParentNode.RemoveChild(nodes[i]);
    }

    doc.DumpFormatted();

    stringReader.Close();
    sgmlReader.Close();
}

Casting to XmlNode to use the generic list is not ideal, but I did it for the sake of space and demonstration.

Also, you shouldn't need both
doc.GetElementsByTagName("script") and
doc.SelectNodes("script[@type = 'text/javascript']").
Again, I did that for the sake of demonstration.

If you have other scripts and you only want to remove JavaScript, use the latter. If you're removing all script tags, use the first one. Or, use both if you want.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top