Вопрос

DOCUMENT: http://en.wikiquote.org/wiki/The_Matrix

I'd want to get all quotes (//ul/li) of the first section (Neo's quotes).

I cannot do //ul[1]/li because in some wikiquote's pages a quote is represented in this form

<h2><span class="mw-headline" id="Neo">Neo</span></h2>  

<ul>
 <li> First quote </li>
</ul> 

<ul>
 <li> Second quote </li>
</ul> 

<h2><span class="mw-headline" id="dont wanna this">Useless</span></h2>  

Instead of

<ul>
     <li> First quote </li>
     <li> Second quote </li>
</ul>

I've tried this to get the first section

(//*[@id='mw-content-text']/ul/preceding-sibling::h2/span[@class='mw-headline'])[1]

but I having problem to get only the quotes of the first section. May you help me?

Это было полезно?

Решение

Use:

(//h2[span/@id='Neo'])[1]/following-sibling::ul
  [count(.
        |
         (//h2[span/@id='Neo'])[1]
            /following-sibling::h2[1]
              /preceding-sibling::ul
         )
  =
   count((//h2[span/@id='Neo'])[1]
            /following-sibling::h2[1]
              /preceding-sibling::ul
         )
  ]
   /li

This selects all li that immediately follow the first h2 with a span child that has an id attribute with value "Neo".

To select the qoutatations for the second such h2, simply replace in the above expression 1 with 2.

Do this for all numbers: 1,2, ..., count(//h2[span/@id='Neo'])

XSLT - based verification:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "(//h2[span/@id='Neo'])[1]/following-sibling::ul
      [count(.
            |
             (//h2[span/@id='Neo'])[1]
                /following-sibling::h2[1]
                  /preceding-sibling::ul
             )
      =
       count((//h2[span/@id='Neo'])[1]
                /following-sibling::h2[1]
                  /preceding-sibling::ul
             )
      ]
        /li

   "/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the provided XML document:

<html>
 <h2><span class="mw-headline" id="Neo">Neo</span></h2>

 <ul>
  <li> First quote </li>
 </ul>

 <ul>
  <li> Second quote </li>
 </ul>

 <h2><span class="mw-headline" id="dont wanna this">Useless</span></h2>  >
</html>

the XPath expression is evaluated, and the selected nodes are copied to the output:

<li> First quote </li>
<li> Second quote </li>

Explanation:

This follows from the Kayessian (by Dr. Michael Kay) formula for intersection of two node-sets:

$ns1[count(.|$ns2) = count($ns2)]

the above selects exactly all nodes that belong both to the nodeset $ns and the nodeset $ns2.

So, we substitute $ns1 with the nodeset consisting of all following siblings ul of the h2 of interest. We substitute $ns2 with the nodeset consisting of all preceding siblings ul of the h2 that is the immediate (1st) following sibling of the h2 of interest.

The intersection of these two nodesets contains exactly all ul elements that are wanted.


Update: In a comment the OP states that he only knows that he wants the results to be from the first section -- the string "Neo" isn't known.

Here is the modified solution:

(//h2[span/@id=$vSectionId])[1]
            /following-sibling::ul
  [count(.
        |
         (//h2[span/@id=$vSectionId])[1]
            /following-sibling::h2[1]
              /preceding-sibling::ul
         )
  =
   count((//h2[span/@id=$vSectionId])[1]
            /following-sibling::h2[1]
              /preceding-sibling::ul
         )
  ]
    /li

The variable $vSectionId must be obtained as the string value of the following XPath expression:

  substring(//div[h2='Contents']
              /following-sibling::ul[1]
                 /li[1]/a/@href,
            2)

Here we are getting the wanted id from the href of the a in the first Table Of Contents entry, and skipping the first character "#".

Here is again an XSLT - based verification:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:variable name="vSectionId" select=
 "substring(//div[h2='Contents']
                      /following-sibling::ul[1]
                         /li[1]/a/@href,
                    2)
 "/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "(//h2[span/@id=$vSectionId])[1]
                /following-sibling::ul
      [count(.
            |
             (//h2[span/@id=$vSectionId])[1]
                /following-sibling::h2[1]
                  /preceding-sibling::ul
             )
      =
       count((//h2[span/@id=$vSectionId])[1]
                /following-sibling::h2[1]
                  /preceding-sibling::ul
             )
      ]
        /li

   "/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the complete XML document that is at: http://en.wikiquote.org/wiki/The_Matrix, the result of applying these two XPath expressions (substituting the result of the first in the second, then evaluating the second expression) is the wanted, correct one:

<li>I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.</li>
<li>Whoa.</li>
<li>I know kung-fu.</li>
<li>Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.</li>
<li>Guns.. lots of guns...</li>
<li>There is no spoon.</li>
<li>My name...is Neo!</li>

Другие советы

Using the API will make it MUCH easier to parse. Here's a query that will pull the first section:

http://en.wikiquote.org/w/api.php?action=parse&page=The_Matrix&section=1&prop=wikitext

Output:

<?xml version="1.0"?>
<api>
  <parse title="The Matrix">
    <wikitext xml:space="preserve">== Neo ==
[[File:The.Matrix.glmatrix.2.png|thumb|right|Unfortunately, no one can be ''told'' what The Matrix is. You have to see it for yourself.]]
[[Image:Arty spoon.jpg|thumb|right|Do not try to bend the spoon — that's impossible. Instead, only try to realize the truth: there is no spoon.]]

* I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.

* Whoa.
* I know kung-fu.

* Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.

* Guns.. lots of guns...

* There is no spoon. 

* My name...is Neo!</wikitext>
  </parse>
</api>

Here's one way to parse this (using HTTParty):

require 'httparty'

class Wikiquote
  include HTTParty
  base_uri 'en.wikiquote.org/w/'

  def self.get_quotes(page)
    url = "/api.php?action=parse&page=#{page}&section=1&prop=wikitext&format=xml"
    headers = {"User-Agent" => "Wikiquote scraper 1.0"}
    content = get(url, headers: headers)['api']['parse']['wikitext']['__content__']
    return content.scan(/^\* (.*)$/).flatten
  end
end

Usage:

Wikiquote.get_quotes("The_Matrix")

Output:

["I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.",
 "Whoa.",
 "I know kung-fu.",
 "Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.",
 "Guns.. lots of guns...",
 "There is no spoon. ",
 "My name...is Neo!"]

I suggest //ul[preceding-sibling::h2[1][span/@id = 'Neo']]/li. Or if the id attribute also not present respectively not relevant for the search, then based on the answer in a comment I think you want

(//h2[span[contains(@class, 'mw-headline')]])[1]/following-sibling::ul
   [1 = count(preceding-sibling::h2[1] | (//h2[span[contains(@class, 'mw-headline')]])[1])]/li

See XPath axis, get all following nodes until for an explanation and I hope I have managed to close all brackets and braces correctly, don't have time now to test.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top