如何解析网页并提取所有 href 链接？

https://stackoverflow.com/questions/99279

01-07-2019
|

题

我想在 Groovy 中解析网页并提取所有 href 链接及其关联文本。

如果页面包含这些链接：

<a href="http://www.google.com">Google</a><br />
<a href="http://www.apple.com">Apple</a>

输出将是：

Google, http://www.google.com<br />
Apple, http://www.apple.com

我正在寻找 Groovy 答案。又名。最简单的方法！

解决方案

假设格式良好的 XHTML，读取 xml，收集所有标签，找到“a”标签，并打印出 href 和文本。

input = """<html><body>
<a href = "http://www.hjsoft.com/">John</a>
<a href = "http://www.google.com/">Google</a>
<a href = "http://www.stackoverflow.com/">StackOverflow</a>
</body></html>"""

doc = new XmlSlurper().parseText(input)
doc.depthFirst().collect { it }.findAll { it.name() == "a" }.each {
    println "${it.text()}, ${it.@href.text()}"
}

其他提示

快速谷歌搜索发现了一个看起来不错的可能性，标签汤.

我不懂java，但我认为xpath比经典的正则表达式要好得多，以便获取一个（或多个）html元素。

它也更容易编写和阅读。

<html>
   <body>
      <a href="1.html">1</a>
      <a href="2.html">2</a>
      <a href="3.html">3</a>
   </body>
</html>

对于上面的 html，这个表达式“/html/body/a”将列出所有 href 元素。

这是一个很好的分步教程 http://www.zvon.org/xxl/XPathTutorial/General/examples.html

使用 XMLSlurper 将 HTML 解析为 XML 文档，然后使用带有适当闭包的 find 方法来选择 a 标签，然后使用 GPathResult 上的 list 方法来获取标签列表。然后，您应该能够提取文本作为 GPathResult 的子项。

尝试正则表达式。像这样的东西应该有效：

(html =~ /<a.*href='(.*?)'.*>(.*?)<\/a>/).each { url, text -> 
    // do something with url and text
}

看一眼 Groovy - 教程 4 - 正则表达式基础知识和锚标记正则表达式破坏.

仅当 HTMl 格式良好时，使用 XMlSlurper 进行解析才有效。

如果您的 HTMl 页面具有格式不正确的标签，请使用正则表达式来解析页面。

前任： <a href="www.google.com">

这里，“a”不是封闭的，因此格式不正确。

 new URL(url).eachLine{
   (it =~ /.*<A HREF="(.*?)">/).each{
       // process hrefs
   }
}

HTML解析器 +正则表达式任何语言都可以这样做，尽管我会说Perl是最快的解决方案。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow