How do you parse a web page and extract all the href links?

https://stackoverflow.com/questions/99279

01-07-2019
|

Question

I want to parse a web page in Groovy and extract all of the href links and the associated text with it.

If the page contained these links:

<a href="http://www.google.com">Google</a><br />
<a href="http://www.apple.com">Apple</a>

the output would be:

Google, http://www.google.com<br />
Apple, http://www.apple.com

I'm looking for a Groovy answer. AKA. The easy way!

Solution

Assuming well-formed XHTML, slurp the xml, collect up all the tags, find the 'a' tags, and print out the href and text.

input = """<html><body>
<a href = "http://www.hjsoft.com/">John</a>
<a href = "http://www.google.com/">Google</a>
<a href = "http://www.stackoverflow.com/">StackOverflow</a>
</body></html>"""

doc = new XmlSlurper().parseText(input)
doc.depthFirst().collect { it }.findAll { it.name() == "a" }.each {
    println "${it.text()}, ${it.@href.text()}"
}

OTHER TIPS

A quick google search turned up a nice looking possibility, TagSoup.

I don't know java but I think that xpath is far better than classic regular expressions in order to get one (or more) html elements.

It is also easier to write and to read.

<html>
   <body>
      <a href="1.html">1</a>
      <a href="2.html">2</a>
      <a href="3.html">3</a>
   </body>
</html>

With the html above, this expression "/html/body/a" will list all href elements.

Here's a good step by step tutorial http://www.zvon.org/xxl/XPathTutorial/General/examples.html

Use XMLSlurper to parse the HTML as an XML document and then use the find method with an appropriate closure to select the a tags and then use the list method on GPathResult to get a list of the tags. You should then be able to extract the text as children of the GPathResult.

Try a regular expression. Something like this should work:

(html =~ /<a.*href='(.*?)'.*>(.*?)<\/a>/).each { url, text -> 
    // do something with url and text
}

Take a look at Groovy - Tutorial 4 - Regular expressions basics and Anchor Tag Regular Expression Breaking.

Parsing using XMlSlurper only works if HTMl is well-formed.

If your HTMl page has non-well-formed tags, then use regex for parsing the page.

Ex: <a href="www.google.com">

here, 'a' is not closed and thus not well formed.

 new URL(url).eachLine{
   (it =~ /.*<A HREF="(.*?)">/).each{
       // process hrefs
   }
}

Html parser + Regular expressions Any language would do it, though I'd say Perl is the fastest solution.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow