Question

I have to parse an XML file with a large number of string values. For example:

<value>Foo</value>
<value>Bar</value>
<value>Baz</value>
<value>Foo</value>

Some of them are equal. There are multiple recurring strings, not just one as in the example above. Hence I would like to detect such values, and link them with XLink: to create a reference at one of the instances of a recurring string (doesn't have to be at the first one), and to link the rest (I can use UUIDs), like here:

<value id="D5494447-A010-4F81-9DDA-E5DFFBD616FF">Foo</value>
<value>Bar</value>
<value>Baz</value>
<value href="#D5494447-A010-4F81-9DDA-E5DFFBD616FF"/>

I am starting with XLinks so perhaps the above doesn't make sense. If that is not possible, another possibility is that I can create a dictionary containing such values:

{'D5494447-A010-4F81-9DDA-E5DFFBD616FF' : 'Foo'}

And then somehow put them in the XML. What is the simplest way to achieve these? I don't care much about the most efficient way as long as the method is correct and simple to implement, since I am a Python beginner and not a computer scientist, and computational complexity is not an issue. Parsing and writing XMLs is not an issue (I figured it out with lxml), so the question here is only about the detection of recurring strings and their linking.

Was it helpful?

Solution

One way is to maintain a dict (a mapping from arbitrary keys to values) of all strings that you have seen before. So, let's assume you are at the point where you have the value in a variable val, and that there is a dict valdict that's initially empty. The code you would need is something like this:

import uuid
if val in valdict: # We have seen this reference before
    print '<value href="#%s"/>' % valdict[val]
else:              # We need to add this reference
    valdict[val] = str(uuid.uuid4()).upper()
    print '<value id="%s">%s</value>' % (valdict[val], val)

I wouldn't really recommend this simplistic method for forming the XML iself, but it sounds like you are already well-prepared to handle that side of things.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top