Parsing ASP Elements with HtmlAgilityPack

https://stackoverflow.com/questions/23457954

15-07-2023
|

Question

I have a basic ASPX page:

<%@ Page Language="C#" MasterPageFile="SomeMasterPage.master" AutoEventWireup="true" %>
<h1>My ASPX Page</h1>
<div class="content">
    <p>Some content goes here.</p>
</div>

Using HtmlAgilityPack, I want to get the first line from the ASPX page and get access to its attributes (Language, MasterPageFile, and AutoEventWireup). However, when I attempt to use HtmlAgilityPack to load the page's HTML, the first line is returned as a text node.

public static class Program
{
    public static void Main(string[] args)
    {
        var parser = new Parser();
        parser.Parse("some-page.aspx");
    }
}

public class Parser
{
    public void Parse(string path)
    {
        HtmlDocument document = new HtmlDocument();
        document.Load(path);

        HtmlNode childNode = document.DocumentNode.ChildNodes[0]; 
        // childNode is an HtmlTextNode
    }
}

I realize that the opening ASPX line isn't, in fact, HTML, which is most likely why HtmlAgilityPack is returning it as a text node. Now, I can use this returned text to manually parse out the values from the attributes, but I would rather it be treated like a standard HTML node. Is there any way to teach HtmlAgilityPack to treat the top line as an HTML node?

Solution

I don't think there is a way to make HtmlAgilityPack read invalid element as html element. How about a little hack :

//get the first line string
var firstNodeText = doc.DocumentNode.ChildNodes[0].InnerHtml;

//do simple string manipulation to change invalid element to become a valid html element
//in this example we change this : <%@ .... %> to become : <_asp .... />
HtmlNode firstNode = HtmlNode.CreateNode(firstNodeText.Replace("<%@", "<_asp").Replace("%>", "/>"));

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow