使用XmlSlurper:如何选择子元件而迭代过GPathResult

https://stackoverflow.com/questions/1675542

16-09-2019
|

题

我写HTML分析器，其使用TagSoup通过形成良好的结构来XMLSlurper.

这里是广义的代码:

def htmlText = """
<html>
<body>
<div id="divId" class="divclass">
<h2>Heading 2</h2>
<ol>
<li><h3><a class="box" href="#href1">href1 link text</a> <span>extra stuff</span></h3><address>Here is the address<span>Telephone number: <strong>telephone</strong></span></address></li>
<li><h3><a class="box" href="#href2">href2 link text</a> <span>extra stuff</span></h3><address>Here is another address<span>Another telephone: <strong>0845 1111111</strong></span></address></li>
</ol>
</div>
</body>
</html>
"""     

def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText( htmlText );

html.'**'.grep { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

我希望每个让我选择每个'li'反过来这样我就可以检索的相应href和详细地址。相反，我得到这个输出：

#href1#href2: Here is the addressTelephone number: telephoneHere is another addressAnother telephone: 0845 1111111

我已经检查的各种例子在网络上和这些无论是处理XML，或者是一个衬垫的例子喜欢"的链接检索所有这种文件"。这似乎it.h3.a.@href 表达的是收集所有href在案文中，即使我穿这个引用父'li'节点。

你可以让我知道：

为什么我要输出示
我怎么可以检索的href/地址对每个'li'的项目

谢谢。

解决方案

代替查询有发现：

html.'**'.find { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

然后你会得到

#href1: Here is the addressTelephone number: telephone

#href2: Here is another addressAnother telephone: 0845 1111111

查询返回。但是发现返回NodeChild类：

println html.'**'.grep { it.@class == 'divclass' }.getClass()
println html.'**'.find { it.@class == 'divclass' }.getClass()

结果：

class java.util.ArrayList
class groovy.util.slurpersupport.NodeChild

因此如果你想使用的查询，你可以那么窝的另一个每一个这样它的工作

html.'**'.grep { it.@class == 'divclass' }.ol.li.each {
    it.each { linkItem ->
        def link = linkItem.h3.a.@href
        def address = linkItem.address.text()
        println "$link: $address\n"
    }
}

长话短说，在你的情况下，使用找，而不是牛瘟很忙.

其他提示

这是一个棘手的一个。当只有一个元件class='divclass'前面的答案肯定是好的。如果有多个结果查询，然后发现()对于一个单一的结果不是答案。他指出，结果是对列表是正确的。插入一个外套.each()循环提供了一个GPathResult在封闭参数 div.从这里钻下可以继续预期的结果。

html."**".grep { it.@class == 'divclass' }.each { div -> div.ol.li.each { linkItem ->
   def link = linkItem.h3.a.@href
   def address = linkItem.address.text()
   println "$link: $address\n"
}}

该行为的原始代码可以使用更多一点的解释。当一个酒店进行访问在一个列表中的绝妙，你会得到一个新列表(同大小)与酒店的每一个元件中的列表。名单找到通过查询()只有一个条目。然后我们获得的一项财产 ol, ，这是好的。接下来，我们得到的结果醇。它对于该条目。它是一个列表的大小()==1次，但是这一次有一个条目的大小()==2.我们可以申请外循环有和获得相同的结果，如果我们想到:

html."**".grep { it.@class == 'divclass' }.ol.li.each { it.each { linkItem ->
   def link = linkItem.h3.a.@href
   def address = linkItem.address
   println "$link: $address\n"
}}

在任何GPathResult代表多个节点，我们得到级联的所有文本。那是原来的结果，为第一 @href, 然后对地址.

我相信先前的答案都是正确的，在编写本报告时，版本使用。但是，我使用HTTPBuilder0.7.1和技术发展的看法2.4.4与常规2.3.7并且有是一个大问题 HTML元素转变为大写。 它的出现，这是由于NekoHTML使用的发动机罩下:

http://nekohtml.sourceforge.net/faq.html#uppercase

因此，该方案在接受的答案必须是写为：

html.'**'.find { it.@class == 'divclass' }.OL.LI.each { linkItem ->
    def link = linkItem.H3.A.@href
    def address = linkItem.ADDRESS.text()
    println "$link: $address\n"
}

这是非常令人沮丧的调试，希望它能帮助别人。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow