사용 XmlSlurper:선택하는 방법 하위 요소는 동안 반복해 GPathResult

https://stackoverflow.com/questions/1675542

16-09-2019
|

문제

쓰고 HTML 파서 사용하는 TagSoup 을 통과 잘 형성하는 구조 XMLSlurper.

여기에 일반화된 코드:

def htmlText = """
<html>
<body>
<div id="divId" class="divclass">
<h2>Heading 2</h2>
<ol>
<li><h3><a class="box" href="#href1">href1 link text</a> <span>extra stuff</span></h3><address>Here is the address<span>Telephone number: <strong>telephone</strong></span></address></li>
<li><h3><a class="box" href="#href2">href2 link text</a> <span>extra stuff</span></h3><address>Here is another address<span>Another telephone: <strong>0845 1111111</strong></span></address></li>
</ol>
</div>
</body>
</html>
"""     

def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText( htmlText );

html.'**'.grep { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

제가 기대하는 것은 각자 선택한 각각의'li'에 그래서 나를 검색할 수 있습니다 해당 href 및 상세 주소.대신,나이 출력:

#href1#href2: Here is the addressTelephone number: telephoneHere is another addressAnother telephone: 0845 1111111

확인했는데 각종 예로 웹에서 이러한 중과 거래를 XML,또는 한-라이너의 예와 같은"검색에 대한 모든 링크에서는 이 파일에서".그것은 보인다 it.h3.a.@href 표현은 모든 hrefs 에서 텍스트,더라도 나는 그것을 통과하는 참조 부모'li'노드입니다.

볼 수 있습니다:

왜 나는 다음과 같이 출력
어떻게 검색할 수 있습니다 href/주소 쌍에 대한 각각의'li'항목

감사합니다.

해결책

대체 grep 찾기:

html.'**'.find { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

다음 당신을 얻을 것

#href1: Here is the addressTelephone number: telephone

#href2: Here is another addressAnother telephone: 0845 1111111

grep 반환 ArrayList 지만 반환합니 NodeChild 클래스:

println html.'**'.grep { it.@class == 'divclass' }.getClass()
println html.'**'.find { it.@class == 'divclass' }.getClass()

결과:

class java.util.ArrayList
class groovy.util.slurpersupport.NodeChild

따라서하고 싶은 경우에는 사용 grep 할 수 있습니다 둥지를 또 다른 각각 다음과 같이 그것을 위해 작업

html.'**'.grep { it.@class == 'divclass' }.ol.li.each {
    it.each { linkItem ->
        def link = linkItem.h3.a.@href
        def address = linkItem.address.text()
        println "$link: $address\n"
    }
}

긴 이야기를 짧은,당신의 경우,사용하 찾기 보다는 오히려 grep.

다른 팁

이는 까다로운 하나입니다.가있는 경우 단지 하나의 요소 class='divclass'이전 답변을 확실은 괜찮습니다.는 경우가 있었을 거에서 결과 grep,다음 찾기()위해 하나의 결과가 응답하지 않습니다.는 결과가 ArrayList 올바른 것입니다.를 삽입하는 외부 중첩.각()루프로 제공합 GPathResult 에서 매개 변수는 폐쇄 div.여기에서는 드릴다운 계속할 수 있으로 예상한 결과입니다.

html."**".grep { it.@class == 'divclass' }.each { div -> div.ol.li.each { linkItem ->
   def link = linkItem.h3.a.@href
   def address = linkItem.address.text()
   println "$link: $address\n"
}}

동작의 원래 코드를 사용할 수 있습 좀 더 설명 뿐만 아니라.할 때는 속에서 액세스 목록에서 그녀는,당신을 얻을 것이다 새로운 목록(동일한 크기)가 제공하는 각 요소의 목록에 있습니다.목록에 의해 발견 grep()를 하나의 항목입니다.그런 다음 우리는 하나의 항목에 대한 제공 ol, 는 괜찮습니다.다음으로 우리는 결과를 얻을 수의 ol.그것은 그 항목입니다.그것은 목록의 크기()==1 그러나 이번에 입장의 크기()==2.우리는 적용할 수 있는 외부 루프가와 같은 결과를 얻을하려는 경우:

html."**".grep { it.@class == 'divclass' }.ol.li.each { it.each { linkItem ->
   def link = linkItem.h3.a.@href
   def address = linkItem.address
   println "$link: $address\n"
}}

에 GPathResult 나타내는 여러 개의 노드,우리가 얻을 연결하는 모든 텍스트입니다.는 원래 결과,첫 대 @href, 다음 주소.

나는 이전의 답변은 모든 올바른 글을 쓰는 시점에서 버전을 사용합니다.하지만 내가 사용하 HTTPBuilder0.7.1 및배 2.4.4 와 그루비 2.3.7 고 있는 것은 큰 문제 HTML 요소는 변화하는 대문자로 표시됩니다. 그것은 나타나 이로 인해 NekoHTML 에서 사용 hood:

http://nekohtml.sourceforge.net/faq.html#uppercase

이 때문에,해결책에서 허용되는 대답으로 작성해야 합니다:

html.'**'.find { it.@class == 'divclass' }.OL.LI.each { linkItem ->
    def link = linkItem.H3.A.@href
    def address = linkItem.ADDRESS.text()
    println "$link: $address\n"
}

이 매우 실망을 디버깅,그것은 누군가가 도움이 됩니다.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow