Question

good morning! i am using c# (framework 3.5sp1) and want to parse following piece of html via regex:

<h1>My caption</h1>
<p>Here will be some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

i need following output:

  • group 1: content of h1
  • group 2: content of h1-following text
  • group 3-n: content of subcaptions + text

what i have atm:

<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>

this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/>. for parsing the h1-caption i have another pattern (<h1.*?>(.*?)</h1>), which only gives me the caption but not the content - i'm fine with that atm.

does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)?

edit:
as some brought in HTMLAgilityPack, i was curious about this nice tool. i accomplished getting content of the <h1>-tag.
but ... myproblem is parsing the rest. this is caused by: the tags for the content may vary - from <p> to <div> and <ul>... atm this seems more or less iterate over the whole document and parsing tag for tag ...? any hints?

Was it helpful?

Solution

You will really need HTML parser for this

OTHER TIPS

Don't use regex to parse HTML. Consider using the HTML Agility Pack.

There are some possibilities:

REGEX - Fast but not reliable, it cant deal with malformed html.

HtmlAgilityPack - Good, but have many memory leaks. If you want to deal with a few files, there is no problem.

SGMLReader - Really good, but there are a problem. Sometimes it cant find the default namespace to get others nodes, then it is impossible to parse html.

http://developer.mindtouch.com/SgmlReader

Majestic-12 - Good but not so fast as SGMLReader.

http://www.majestic12.co.uk/projects/html_parser.php

Example for SGMLreader (VB.net)

Dim sgmlReader As New Sgml.SgmlReader()
Public htmldoc As New System.Xml.Linq.XDocument
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = New System.IO.StringReader(vSource)
sgmlReader.CaseFolding = CaseFolding.ToLower
htmldoc = XDocument.Load(sgmlReader)    
Dim XNS As XNamespace 

' In this part you can have a bug, sometimes it cant get the Default Namespace*********
Try
      XNS = htmldoc.Root.GetDefaultNamespace
Catch
        XNS = "http://www.w3.org/1999/xhtml"
End Try
If XNS.NamespaceName.Trim = "" Then
        XNS = "http://www.w3.org/1999/xhtml"
End If

'use it with the linq commands
For Each link In htmldoc.Descendants(XNS + "script")
        Scripts &= link.Value
Next

In Majestic-12 is different, you have to walk to every tag with a "Next" command. You can find a example code with the dll.

As others have mentioned, use the HtmlAgilityPack. However, if you like jQuery/CSS selectors, I just found a fork of the HtmlAgilityPack called Fizzler: http://code.google.com/p/fizzler/ Using this you could find all <p> tags using:

var pTags = doc.DocumentNode.QuerySelectorAll('p').ToList();

Or find a specific div like <div id="myDiv"></div>:

var myDiv = doc.DocumentNode.QuerySelectorAll('#myDiv');

It can't get any easier than that!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top