ElementTree XML Parsing just returns sitemap.org?

https://stackoverflow.com/questions/21997332

16-10-2022
|

Pergunta

Tried looking around for a simple explanation of where I'm going wrong with this but couldn't really find one. The following excerpt of code:

import time, threading, urllib2, os
import xml.etree.ElementTree as ET

save_path = '/Users/sampeka/Desktop/Programming/SilkySpider/Data'
bloomberg_site_map = urllib2.urlopen('http://www.bloomberg.com/sitemap_news.xml').read()
reuters_site_map = urllib2.urlopen('http://www.reuters.com/sitemap_news_index.xml').read()

def saveXmlFile(data,name):
    try:
        abs_path = os.path.abspath(save_path)
        open_file = open(abs_path+'/'+name,'w')
        open_file.write(data)
    finally:
        open_file.close()

class ParseXML:

    def __init__(self,xml_file):
        self.xml_file = xml_file

    def printStuff(self):
        tree = ET.parse(self.xml_file)
        root = tree.getroot()
        for child in root:
            print child.tag, child.attrib


saveXmlFile(bloomberg_site_map,'Bloomberg Site Map.xml')
ParseXML(save_path+'/Bloomberg Site Map.xml').printStuff()

returns this several times:

{http://www.sitemaps.org/schemas/sitemap/0.9}url
{http://www.sitemaps.org/schemas/sitemap/0.9}url
{http://www.sitemaps.org/schemas/sitemap/0.9}url
{http://www.sitemaps.org/schemas/sitemap/0.9}url
{http://www.sitemaps.org/schemas/sitemap/0.9}url

The XML is being saved correctly so I must just be missing something simple. Could somebody explain why this just gets stuck on this? Thanks a lot for the help.

Solução

Your code is iterating through the children of the XML root element. Since your XML document (looked at the bloomberg one) contains:

<urlset ...>
  <url ...>
    ...
  </url>
  <url ...>
    ...
  </url>
  ...
</urlset>

The output is the list of url elements.

You haven't stated what output you would like to get. However, you most likely need to either iterate through each XML element recursively or use xpath to extract specific parts of the document.

Example: to extract publication_date fields:

import lxml.etree
tree = lxml.etree.parse(self.xml_file)
root = tree.getroot()
for pd in root.xpath("//*[local-name()='publication_date' and namespace-uri()='http://www.google.com/schemas/sitemap-news/0.9']"):
    print pd.text

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow