Question

I'm working on a python utility to search and present the full path to a record in a very big config file, stored as xml file. The file size can be 12M and can hold 294460 lines, and can grow to much larger size.

Here is an example(simplified):

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<root version="1.1.1">
  <record path="">
    <record path="path1">
      <field name="some_name1" value="1234"/>
      <record path="path2">
        <field name="0" value="abcd0"/>
        <field name="1" value="abcd1"/>
        <field name="2" value="abcd2"/>
        <field name="28" value="abcd28"/>
        <field name="29" value="abcd29"/>
      </record>
    </record>
    <record path="pathx">
      <record path="pathy">
        <record path="pathz">
        </record>
        <record path="pathv">
          <record path="pathw">
            <field name="some_name1" value="yes"/>
            <field name="some_name2" value="2084"/>
            <field name="some_buffer_name" value="14"/>
            <record path="cache_value">
              <field name="some_name7000" value="12"/>
            </record>
    </record>
        <record path="path_something">
          <field name="key_word1" value="8"/>
          <field name="key_word2" value="9"/>
          <field name="key_word3" value="10"/>
          <field name="key5" value="1"/>
          <field name="key6" value="1"/>
          <field name="key7" value="yes"/>
    </record>
  </record>
</root>

I'm interested to run on the file and file all the nodes that hold the search string in the path or field attribute of the node. Because the node "type" (or node name) can be or record or field, and thus the name of the attribute can change from path to name.

I used minidom to parse and search in the xml, but my code is taking too much resources and much too much time.

This is what I wrote: xml_file_location is the location of the file string_to_search is the string that I search for in the XML file and the path to that node that I found is stored in the node of type: record , in an attribute named: path, and this is what I print to the user.

with open(xml_file_location, 'r') as inF:
# search the xml file for lines with the string for search
    for line in inF:
        if string_to_search in line:
            found_counter = found_counter + 1
            node_type = line.strip(" ").split(" ")[0]
            node_type = re.sub('[^A-Za-z0-9]+', '', node_type)
            node_attr = line.strip(" ").split(" ")[1]
            node_value = re.sub('[^A-Za-z0-9_]+', '', node_attr.split("=")[1])
            node_attr = re.sub('[^A-Za-z0-9]+', '', node_attr.split("=")[0])
            #print node_type
            #print node_attr
            #print node_value
            if node_type in lines_dict:
                if not node_attr in lines_dict[node_type]:
                    lines_dict[node_type][node_attr] = [nome_value]
                elif not node_value in lines_dict[node_type][node_attr]:
                        lines_dict[node_type][node_attr].append(node_value)
            else:
                lines_dict[node_type] = {}
                lines_dict[node_type][node_attr] = [node_value]

print "Found: %s strings in the xml file" %found_counter

#pp = pprint.PrettyPrinter(indent=4)
#pp.pprint(lines_dict)

print "Parsing the xml file"
dom = parse(xml_file_location)

print "Locating the full path"

for node_type in lines_dict:
# for all types of node
    elements = dom.getElementsByTagName(node_type)
    # create the elements for those nodes
    for node in elements:
    # go over all nodes in the elements
        if node.hasAttribute(node_attr):
            if node.getAttribute(node_attr) in lines_dict[node_type][node_attr]:
            # check if the attribute appears in the lines dict 
                result = node.getAttribute(node_attr) # holds the path
                parent = node.parentNode       # create a pointer to point on the parent node
                while parent.getAttribute("path") != "":
                # while didn't reach the root of the conf - a record that has an empty path attribute
                    result = parent.getAttribute("path") + "." + result     # add the path of the parent to the full path
                    parent = parent.parentNode                              # advance the parent pointer
                print
                print "Found: %s" %node.toprettyxml().split("\n")[0]
                print "Path:  %s" %result

for example: I will search for: abcd1 the utility will print the full path: path1.path2 or I will search for: pathw and the utility will return: pathx.pathy.pathv

I understand that it's very inefficient, I go over all nodes in the config and compare them to what I put in the list_dic in the simple string search.

I tried to use external modules to do it, but without success

I'm looking for a more efficient way to do this kind of search and I very appreciate the help.

Was it helpful?

Solution 2

This is the solution I got using xpath and lxml:

root = LXML.parse(xml_file_location) elements_list = root.xpath(".//*[@*[contains(., $text)]]", text = string_to_search )

The element list will catch all nodes that their attribute value holds the string.

using lxml getpaerent() method I get ask each node "who is your parent"

this is the code:

root = LXML.parse(xml_file_location)
elements_list = root.xpath(".//*[@*[contains(., $text)]]", text = string_to_search )
print "Found: %s strings in the xml file" %len(elements_list)
for node in elements_list:
    print "\nFound:\n%s" %LXML.tostring(node).split("\n")[0]
    parent = node.getparent()
    if node.tag == "record":
        path = node.get("path") # nodes of type: "record" hold the attribute: "path"
    elif node.tag == "field":
        path = node.get("name") # nodes of type: "field" hold the attribute: "name"
    else:
        print "unclear node type adding empty string"
        path = ""
    full_path = path
    while parent.get("path") != "":
        parent_path = parent.get("path")
        full_path = parent_path + "." + full_path
        parent = parent.getparent()
    print "Full path: %s" %full_path

OTHER TIPS

As @Martijn Pieters said, use ElementTree, which is in stdlib since Python 2.5 -- you're using "with" so I assume you're on 2.6+. It's easy enough to learn and its mental model is very close to the DOM.

The old-school alternative is SAX parsing, which has had a module in stdlib since forever: basically you specify callbacks to execute whenever the parser encounters opening or closing tags. It's a bit unnatural (it forces you to think in terms of text-processing, rather than XML logical structures), but can be very efficient.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top