C# - Best Approach to Parsing Webpage?

https://stackoverflow.com/questions/300252

08-07-2019
|

Question

I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?

I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.

Are regular expressions the best way to achieve what I'm trying to accomplish?

Solution

Regular expressions are one way to do it, but it can be problematic.

Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.

You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need.

UPDATE

At the time of this update I've received 15 up and 9 downvotes. I think that maybe people aren't reading the question nor the comments on this answer. All the OP wanted to do was grab the href values. That's it. From that perspective, a simple regex is just fine. If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best.

OTHER TIPS

I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).

HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;

For dealing with HTML of all shapes and sizes I prefer to use the HTMLAgility pack @ http://www.codeplex.com/htmlagilitypack it lets you write XPaths against the nodes you want and get those return in a collection.

Probably you want something like the Majestic parser: http://www.majestic12.co.uk/projects/html_parser.php

There are a few other options that can deal with flaky html, as well. The Html Agility Pack is worth a look, as someone else mentioned.

I don't think regexes are an ideal solution for HTML, since HTML is not context-free. They'll probably produce an adequate, if imprecise, result; even deterministically identifying a URI is a messy problem.

It is always better, if possible not to rediscover the wheel. Some good tools exist that either convert HTML to well-formed XML, or act as an XmlReader:

Here are three good tools:

TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Taggle is a commercial C++ port of TagSoup.
SgmlReader is a tool developed by Microsoft's Chris Lovett.
SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
Download the zip file including the standalone executable and the full source code: SgmlReader.zip
An outstanding achievement is the pure XSLT 2.0 Parser of HTML written by David Carlisle.

Reading its code would be a great learning exercise for everyone of us.

From the description:

"d:htmlparse(string)
d:htmlparse(string,namespace,html-mode)

  The one argument form is equivalent to)
  d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))

  Parses the string as HTML and/or XML using some inbuilt heuristics to)
  control implied opening and closing of elements.

  It doesn't have full knowledge of HTML DTD but does have full list of
  empty elements and full list of entity definitions. HTML entities, and
  decimal and hex character references are all accepted. Note html-entities
  are recognised even if html-mode=false().

  Element names are lowercased (if html-mode is true()) and placed into the
  namespace specified by the namespace parameter (which may be "" to denote
  no-namespace unless the input has explict namespace declarations, in
  which case these will be honoured.

  Attribute names are lowercased if html-mode=true()"

Read a more detailed description here.

Hope this helped.

Cheers,

Dimitre Novatchev.

I agree with Chris Lively, because HTML is often not very well formed you probably are best off with a regular expression for this.

href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']

From here on RegExLib should get you started

You might have more luck using xml if you know or can fix the document to be at least well-formed. If you have good html (or rather, xhtml), the xml system in .Net should be able to handle it. Unfortunately, good html is extremely rare.

On the other hand, regular expressions are really bad at parsing html. Fortunately, you don't need to handle a full html spec. All you need to worry about is parsing href= strings to get the url. Even this can be tricky, so I won't make an attempt at it right away. Instead I'll start by asking a few questions to try and establish a few ground rules. They basically all boil down to "How much do you know about the document?", but here goes:

Do you know if the "href" text will always be lower case?
Do you know if it will always use double quotes, single quotes, or nothing around the url?
Is it always be a valid URL, or do you need to account for things like '#', javascript statements, and the like?
Is it possible to work with a document where the content describes html features (IE: href= could also be in the document and not belong to an anchor tag)?
What else can you tell us about the document?

I've linked some code here that will let you use "LINQ to HTML"...

Looking for C# HTML parser

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow