Process large XML docs in memory

https://stackoverflow.com/questions/9266593

29-04-2021
|

Question

I have a need to hold a very large number of XMLs in memory (most probably will use Oracle Coherence as distributed cache). The expectation is to hold in memory 100,000 XMLs. These XMLs are quite big - approx. 250KB each. These XMLs are requested by other systems - they ask for only part of the XML which is relevant to them. Additionally, they will ask to make changes to the content of the XMLs. The load will be about 300 such requests per minute, distributed more or less evenly between retrievals and updates. An important note is that the XMLs are not structured, so I won't have an XSD for them, but I do have the algorithm to extract and update the XMLs.

My question is what will yield better performance: Keeping the XMLs in memory as they are, and making all the extraction of data from them and the updates by using XQuery or even using coded procedures, or to transform the XMLs into objects, manipulate them in code, and then transform them back to XMLs when they are requested by other systems?

Solution

You have 100,000 docs a 250 KB. That makes approx. 24 GB of raw data. If you put that in memory and want to be able to process, filter or update it you will have and additional blow out factor of let's say 10. Then you end up in a desired memory capacity of 240 GB.

So, if you have enough memory available that is of course the best place to hold it. But you need to have a fallback strategy (What happens if the number of nodes grows out of memory?) and it becomes even more complicated if you don't want to loose updates: What happens if the machine failes? if you update in-memory: when do you flush out updates to disk? And there are even more things to think about.

Yet, to answer your second question: Transforming into objects or not? Most people are tempted to transfrom XML into objects using PHP, ruby, Java, ".NET" or the like and even to store XML in SQL databases. If you want to hear an honest answer: don't do it if you don't have plentiful of time and money to waste. Objects introduce a large overhead of additionally needed analysis, design, parsing, marshalling, testing, maintenance ... In fact, this removes the flexibility from XML completely and I see this constantly underestimated. From my experience working with XML and XQuery saves you around 80% on average for the things I've listed above.

Also, if you force flexible XML data into objects, you will face a nightmare if your data structures evolve.

You might want to check out 28msec's Scalable Database for flexible data which is a PAAS in the cloud. There you get everything you need out of the box (including loadbalancing, auto-recovery, persistence management, replication, backups, automatic failover, scaling in and out, elasticity, memory management, sharding, ...).

This is only my personal opinion, but maybe it contributes at least some more aspects to your problem solution.

OTHER TIPS

My guess is that it will be faster in memory (if you have enough room). But with all performance issues this is caviated with a big "it depends". You need to profile the actual usages.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow