Parsing HTML Page into Parent-Child Object C#

https://stackoverflow.com//questions/23015735

21-12-2019
|

Question

I'm parsing the html page, and I'm new to this kind of parsing, could you suggest me the idea to parse following html enter image description here

HTML Code : http://notepad.cc/share/CFRURbrk3r

for each type of room, there are list of sub rooms so I wish to group them as Parent - Childs into the List of Objects. then later we can access to each of those childs.

this is the code as far as I could do but without adding to the Objects, besides Fizzler is there any other parser I can do in this case.

var uricontent = File.ReadAllText("TestHtml/Bew.html"); 
var html = new HtmlDocument(); // with HTML Agility pack         
html.LoadHtml(uricontent);                      
var doc = html.DocumentNode;                      
var rooms = (from r in doc.QuerySelectorAll(".rates")                             
             from s in r.QuerySelectorAll(".rooms")                           
             from rd in r.QuerySelectorAll(".rate")                           
             select new 
             {                  
                Name = rd.QuerySelector(".rate-description").InnerText.CleanInnerText(), 
                Price = r.QuerySelector(".rate-price").InnerText.CleanInnerText(),
                RoomType = s.QuerySelector("tr td h2").InnerText.CleanInnerText()   
             }).ToArray();

Solution

Update:

Personally, I wouldn't use an Array. I would use a List. The implementation of a List should allow you to add particular nodes into particular positions and grouped accordingly.

Then you could simply:

Loop (foreach)
Find
Sort
Select

Which would allow you to quickly filter through the content. Since each list item is stored. Some examples.

Update:

Another item I forgot to mention, the Html Agility Pack can do the following:

Grab a particular node / element.
Grab a Parent and all subsequent Children node / elements.

It can also grab remote or local pages.

I would actually download the Html Agility Pack from Nuget. It is incredibly powerful and robust, it will more than likely make it even easier to scrub the desired data. You can download it by following these steps:

Go to Tools
Go to Nuget Package Manager
Select Package Manager Console
Open the Package Manager Console in lower left of Visual Studio if it didn't open.
Type the following command Install-Package HtmlAgilityPack.

A great example can be found from this question.

The premise is simple:

HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();

// Map the document to the Html Page.
document.Load(filePath);

// If you would rather do it through Xml String, should you require it.

if (document.DocumentNode != null)
{
     HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
     if( bodyNode != null)
     {
           // Do something with bodyNode.
     }
}

This example shows the syntax, but it should be far easier to grab particular nodes out of the page and manipulate it accordingly with the HtmlAgilityPack.

Hopefully this points you in a better direction.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow