Question

I have several thousand (ASP.net - messy html) html generated invoices that I'm trying to parse and save into a database.

Basically like:

 foreach(var htmlDoc in HtmlFolder)
 {
   foreach(var inputBox in htmlDoc)
   { 
      //Make Collection of ID and Values Insert to DB
   }
 }  

From all the other questions I've read the best tool for this type of problem is the HtmlAgilityPack, however for the life of me I can't get the documentation .chm file to work. Any ideas on how I could accomplish this with or without the Agility Pack ?

Thanks in advance

Was it helpful?

Solution

An newer alternative to HtmlAgilityPack is CsQuery. See this later question on its relative performance merits, but its use of CSS selectors can't be beat:

var doc = CQ.CreateDocumentFromFile(htmldoc); //load, parse the file
var fields = doc["input"]; //get input fields with CSS
var pairs = fields.Select(node => new Tuple<string, string>(node.Id, node.Value()))
       //get values

OTHER TIPS

To get the CHM to work, you probably need to view the properties in Windows Explorer and uncheck the "Unblock Content" checkbox.

The HTML Agility Pack is quite easy when you know your way around Linq-to-XML or XPath.

Basics you'll need to know:

//import the HtmlAgilityPack
using HtmlAgilityPack;

HtmlDocument doc = new HtmlDocument();

// Load your data
// -----------------------------
// Load doc from file:
doc.Load(pathToFile);

// OR

// Load doc from string:
doc.LoadHtml(contentsOfFile);
// -----------------------------

// Find what you're after
// -----------------------------
// Finding things using Linq
var nodes = doc.DocumentNode.DescendantsAndSelf("input")
    .Where(node => !string.IsNullOrWhitespace(node.Id)
        && node.Attributes["value"] != null
        && !string.IsNullOrWhitespace(node.Attributes["value"].Value));

// OR

// Finding things using XPath
var nodes = doc.DocumentNode
    .SelectNodes("//input[not(@id='') and not(@value='')]");
// -----------------------------


// looping through the nodes:
// the XPath interfaces can return null when no nodes are found
if (nodes != null) 
{ 
    foreach (var node in nodes)
    {
        var id = node.Id;
        var value = node.Attributes["value"].Value;
    }
}

The easiest way to add the HtmlAgility Pack is using NuGet:

PM> Install-Package HtmlAgilityPack

Hah, looks like the ideal time to make a shameless plug of a library I wrote!

This should be rather easy to accomplish with this library (that's built on top of HtmlAgility pack by the way!) : https://github.com/amoerie/htmlbuilders (You can find the Nuget package here: https://www.nuget.org/packages/HtmlBuilders/ )

Code samples:

        const string html = "<div class='invoice'><input type='text' name='abc' value='123'/><input id='ohgood' type='text' name='def' value='456'/></div>";
        var htmlDocument = new HtmlDocument {OptionCheckSyntax = false}; // avoid exceptions when html is invalid
        htmlDocument.Load(new StringReader(html));
        var tag = HtmlTag.Parse(htmlDocument); // if there is a root tag
        var tags = HtmlTag.ParseAll(htmlDocument); // if there is no root tag

        // find looks recursively through the entire DOM tree
        var inputFields = tag.Find(t => string.Equals(t.TagName, "input"));

        foreach (var inputField in inputFields)
        {
            Console.WriteLine(inputField["type"]);
            Console.WriteLine(inputField["value"]);
            if(inputField.HasAttribute("id"))
                Console.WriteLine(inputField["id"]);
        }

Note that inputField[attribute] will throw a 'KeyNotFoundException' if that field does not have the specified attribute name. That's because HtmlTag implements and reuses IDictionary logic for its attributes.

Edit: If you're not running this code in a web environment, you'll need to add a reference to System.Web. That's because this library makes use of the HtmlString class which can be found in System.Web. Just choose 'Add reference' and then you can find it under 'Assemblies > Framework'

You can download HtmlAgilityPack Documents CHM file from here.

If chm file contents are not visible then un-check Always ask before opening this file check-box as shown in screen shot

enter image description here

Note: The above dialog appears for unsigned files

enter image description here

Source: HtmlAgilityPack Documentation

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top