Scraping XML with Python module BeautifulSoup, need a specific tag in the tree

https://stackoverflow.com/questions/22287023

11-06-2023
|

Question

So I've been working on this python script for a while and I'm trying to scrape the Duration and Distance tags under the Leg tag. The problem is that in the Step tag, there is also a sub tag called Duration and Distance and the Step tag is a sub tag of the Leg tag. When I scrape the data it returns those Distance and Duration tags as well. The XML is as follows:

<DirectionsResponse>
        <route>
           <leg>
            <step>...</step>
            <step>
                <start_location>
                <lat>38.9096855</lat>
                <lng>-77.0435397</lng>
                </start_location>
                <duration>
                <text>1 min</text>
                </duration>
                <distance>
                <text>39 ft</text>
                </distance>
            </step>
            <duration>
            <text>2 hours 19 mins</text>
            </duration>
            <distance>
            <text>7.1 mi</text>
            </distance>
              </leg>
        </route>
</DirectionsResponse>

Here is the Python script I'm using:

import urllib
from BeautifulSoup import BeautifulSoup

url = 'https://www.somexmlgenerator.com/directions/xml?somejscript'
res = urllib.urlopen(url)
html = res.read()

soup = BeautifulSoup(html)
soup.prettify()
leg = soup.findAll('leg')

for eachleg in leg:
    another_duration = eachleg('duration')
    print eachleg

As I mentioned I've been at this a while and have tried using lxml as well but I'm having difficultly scraping the XML through it since the XML is dynamically generated. I've taken the approach of instead scraping the XML as HTML but I'm definitely open to other suggestions as I am still quite a novice!

Solution

With BeautifulSoup (use version 4, called bs4), you need to pass recursive=False into findAll to stop it from picking up the wrong durations:

from bs4 import BeautifulSoup

soup = BeautifulSoup(..., 'xml')

for leg in soup.route.find_all('leg', recursive=False):
    duration = leg.duration.text.strip()
    distance = leg.distance.text.strip()

Or with CSS:

for leg in soup.select('route > leg'):
    duration = leg.duration.text.strip()
    distance = leg.distance.text.strip()

With lxml, you just use XPath:

durations = root.xpath('/DirectionsResponse/route/leg/duration/text/text()')
distances = root.xpath('/DirectionsResponse/route/leg/distance/text/text()')

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow