Frage

I have a problem. I understand, but dont know the syntax and way. I have huge XML files. I need to open every file and search for some string in the tag value and return true if it is found. I have multiple occurrences of same tag. Here is one such occurrence of the the tag in the XML file.

<ulink xlink:type="simple"
xlink:href="urn:x-xxx:r2:reg-doc:*-*:*:*?title=XXX"
xlink:title="XXX" xmlns:xlink="http://www.w3.org/1999/xlink"
>XXX</ulink>.</p>

NOTE: i have many such tags available in a single file. I need to read the "xlink:title" content in all such tags and compare with my string. If found, i need to print that. Here is the code i tried.

from xml.dom.minidom import parse, parseString
import os, stat
import sys
def shahul(dir):   
    for r,d,f in os.walk(dir):
        for files in f:
            if files.endswith(".xml"):
                dom=parse(os.path.join(r, files));
                ref=dom.getElementsByTagName('ulink')
                link=ref[0].attributes['xlink:title'].value
                if "mystring" in link:
                    found=True
                    break
                print (files, found, sep='\t')

shahul("location")

NOTE: In the above code i have used link=ref[0].attributes['xlink:title'].value. So is that mean the first occurrence of the ulink tag? So if i want to store content of all occurrence of ulink tag, what should i do?

Is the indexerror due to the fact that there are multiple tags available under the same name? or is it not able to save all the entries under link? Please guide me. Thanks.

War es hilfreich?

Lösung

You can do this:

dom=parse(os.path.join(r, files))
ref=dom.getElementsByTagName('ulink')
for n in ref:
    attr = n.getAttributeNode('xlink:title')
    if attr:
        link = attr.nodeValue.strip()
        print link 

It identifies all the elements by the name ulink and get list of all such node. From that list, it looks for xlink:title attribute and gets the value of this attribute and prints. Instead of print, you can have your if condition.

Andere Tipps

IndexError with ref[0] tells you that the list is empty, not that there are multiple occurences of the tag you are looking for. To process all the found tags, loop over them:

refs = dom.getElementsByTagName('ulink')
for ref in refs:
    #use ref

The loop will simply not run if refs is empty.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top