How to parse this piece of HTML?
-
21-09-2019 - |
문제
good morning! i am using c# (framework 3.5sp1) and want to parse following piece of html via regex:
<h1>My caption</h1>
<p>Here will be some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
i need following output:
- group 1: content of h1
- group 2: content of h1-following text
- group 3-n: content of subcaptions + text
what i have atm:
<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>
this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/>
. for parsing the h1-caption i have another pattern (<h1.*?>(.*?)</h1>
), which only gives me the caption but not the content - i'm fine with that atm.
does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)?
edit:
as some brought in HTMLAgilityPack, i was curious about this nice tool. i accomplished getting content of the <h1>
-tag.
but ... myproblem is parsing the rest. this is caused by: the tags for the content may vary - from <p>
to <div>
and <ul>
...
atm this seems more or less iterate over the whole document and parsing tag for tag ...?
any hints?
해결책
You will really need HTML parser for this
다른 팁
Don't use regex to parse HTML. Consider using the HTML Agility Pack.
There are some possibilities:
REGEX - Fast but not reliable, it cant deal with malformed html.
HtmlAgilityPack - Good, but have many memory leaks. If you want to deal with a few files, there is no problem.
SGMLReader - Really good, but there are a problem. Sometimes it cant find the default namespace to get others nodes, then it is impossible to parse html.
http://developer.mindtouch.com/SgmlReader
Majestic-12 - Good but not so fast as SGMLReader.
http://www.majestic12.co.uk/projects/html_parser.php
Example for SGMLreader (VB.net)
Dim sgmlReader As New Sgml.SgmlReader()
Public htmldoc As New System.Xml.Linq.XDocument
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = New System.IO.StringReader(vSource)
sgmlReader.CaseFolding = CaseFolding.ToLower
htmldoc = XDocument.Load(sgmlReader)
Dim XNS As XNamespace
' In this part you can have a bug, sometimes it cant get the Default Namespace*********
Try
XNS = htmldoc.Root.GetDefaultNamespace
Catch
XNS = "http://www.w3.org/1999/xhtml"
End Try
If XNS.NamespaceName.Trim = "" Then
XNS = "http://www.w3.org/1999/xhtml"
End If
'use it with the linq commands
For Each link In htmldoc.Descendants(XNS + "script")
Scripts &= link.Value
Next
In Majestic-12 is different, you have to walk to every tag with a "Next" command. You can find a example code with the dll.
As others have mentioned, use the HtmlAgilityPack. However, if you like jQuery/CSS selectors, I just found a fork of the HtmlAgilityPack called Fizzler:
http://code.google.com/p/fizzler/
Using this you could find all <p>
tags using:
var pTags = doc.DocumentNode.QuerySelectorAll('p').ToList();
Or find a specific div like <div id="myDiv"></div>
:
var myDiv = doc.DocumentNode.QuerySelectorAll('#myDiv');
It can't get any easier than that!