How can I read an HTML file a Paragraph at a time?

https://stackoverflow.com/questions/21741488

10-10-2022
|

Question

I reckon it would be something like (pseudocode):

var pars = new List<string>();
string par;
while (not eof("Platypus.html"))
{
    par = getNextParagraph();
    pars.Add(par);
}

...where getNextParagraph() looks for the next "<p>" and continues until it finds "</p>", burning its bridges behind it ("cutting" the paragraph so that it is not found over and over again). Or some such.

Does anybody have insight on how exactly to do this / a better methodology?

UPDATE

I tried to use Aurelien Souchet's code.

I have the following usings:

using HtmlAgilityPack;
using HtmlDocument = System.Windows.Forms.HtmlDocument;

...but this code:

HtmlDocument doc = new HtmlDocument();

is unwanted ("Cannot access private constructor 'HtmlDocument' here")

Also, both "doc.LoadHtml()" and "doc.DocumentNode" give the old "Cannot resolve symbol 'Bla'" err msg

UPDATE 2

Okay, I had to prepend "HtmlAgilityPack." so that the ambiguous reference was disambiguated.

Solution

As people suggests in the comments, I think HtmlAgilityPack is the best choice, it's easy to use and to find good examples or tutorials.

Here is what I would write:

//don't forgot to add the reference
using HtmlAgilityPack;

//Function that takes the html as a string in parameter and return a list
//of strings with the paragraphs content.
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{

   var pars = new List<string>();

   //first create an HtmlDocument
   HtmlDocument doc = new HtmlDocument();

   //load the html (from a string)
   doc.LoadHtml(sourceHtml);

   //Select all the <p> nodes in a HtmlNodeCollection
   HtmlNodeCollection paragraphs = doc.DocumentNode.SelectNodes(".//p");

   //Iterates on every Node in the collection
   foreach (HtmlNode paragraph in paragraphs)
   {
      //Add the InnerText to the list
      pars.Add(paragraph.InnerText); 
      //Or paragraph.InnerHtml depends what you want
   }

   return pars;
}

It's just a basic example, you can have some nested paragraphs in your html then this code maybe won't work as expected, it all depends the html you are parsing and what you want to do with it.

Hope it helps!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow