Question

I have a 5 MB XML flat structure that I want to access its data later. I use XOM Parser in Java to parse the XML and I don't want to loop on the whole Tree every time I want to retrieve data as it takes a while because of the file size.

The XML looks like this

<TypeDesc Type="Person" Id="1" PKey="X0" xml:lang="EN" ShDes="t1" LongDes="test 1"/>
<TypeDesc Type="Person" Id="2" PKey="X1" xml:lang="EN" ShDes="t2" LongDes="test 2"/>
<TypeDesc Type="Person" Id="3" PKey="X3" xml:lang="EN" ShDes="t3" LongDes="test 2"/>
...
<TypeDesc Type="Person" Id="n" PKey="PAYMN" xml:lang="EN" ShDes="PAYMN" LongDes="payment"/>
<TypeDesc Type="Student" Id="1" PKey="X0" xml:lang="EN" ShDes="t1" LongDes="good"/>
<TypeDesc Type="Student" Id="2" PKey="X1" xml:lang="EN" ShDes="t2" LongDes="bad"/>
<TypeDesc Type="Student" Id="3" PKey="X3" xml:lang="EN" ShDes="t3" LongDes="fair"/>
...
<TypeDesc Type="Student" Id="n" PKey="PAYMN" xml:lang="EN" ShDes="PAYMN" LongDes="fair"/>

In my LOGIC I want to retrieve the longDes of the Node if PKEY = SOMESTUFF AND Type = OtherStuff

Looping on the whole thing and retrieving the longDes if other attributes are satisfied is very expensive.

How can I store my Data so that I can access them in O(1) instead of O(n) so that I do loop on the whole XML for one time and access the data structure for later iterations.

Was it helpful?

Solution 2

I used a hash table to store data. Constructed a hash table for each type. The key of each hash table is concatenation of all attributes I want to check with and the stored value is what I want to retrieve. It is very efficient and close to O(1)

OTHER TIPS

You are unlikely to find a constant-time lookup procedure for satisfying this in its current form. Moreover, is constant-time lookup a specific requirement or are you making that up as part of a blinkered viewpoint of your project's status/setup? A.K.A. "the XY problem". The best you're likely to find is an O(n log n) or O(log n) algorithm; see the Big O Cheatsheet

I recommend you review existing frameworks that will enable parsing of this structure:

  1. Xstream
  2. JAXB
  3. XML Beans

If you're happy with XOM, don't bother moving across, but I believe you need to consider the structure of data when you're searching, such as using an index, or store it in an efficient form -- e.g. a Prefix Tree/Trie -- and then serialize that to disk/storage so that re-parsing is quicker though an obvious space/time tradeoff?

On top of this, does your data have to be in XML? Can you convert it to another format? Such as Protocol Buffers, or placing the data in a database (either SQL or NoSQL), though this may be overkill depending on what you're doing?

I'd also ask myself the following questions:

  1. How do I get given this data? Am I losing information that may aid lookup?
  2. Will an efficient search algorithm aid here?
  3. Is this data sorted? Can I sort it efficiently so that subsequent lookups are more efficient?
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top