質問

I have a problem...

I need to store a daily barrage of about 3,000 mid-sized XML documents (100 to 200 data elements).

The data is somewhat unstable in the sense that the schema changes from time to time and the changes are not announced with enough advance notice, but need to be dealt with retroactively on an emergency "hotfix" basis.

The consumption pattern for the data involves both a website and some simple analytics (some averages and pie charts).

MongoDB seems like a great solution except for one problem; it requires converting between XML and JSON. I would prefer to store the XML documents as they arrive, untouched, and shift any intelligent processing to the consumer of the data. That way any bugs in the data-loading code will not cause permanent damage. Bugs in the consumer(s) are always harmless since you can fix and re-run without permanent data loss.

I don't really need "massively parallel" processing capabilities. It's about 4GB of data which fits comfortably in a 64-bit server.

I have eliminated from consideration Cassandra (due to complex setup) and Couch DB (due to lack of familiar features such as indexing, which I will need initially due to my RDBMS ways of thinking).

So finally here's my actual question...

Is it worthwhile to look for a native XML database, which are not as mature as MongoDB, or should I bite the bullet and convert all the XML to JSON as it arrives and just use MongoDB?

役に立ちましたか?

解決

You may have a look at BaseX, (Basex.org), with built in XQuery processor and Lucene text indexing.

他のヒント

That Data Volume is Small

If there is no need for parallel data processing, there is no need for Mongo DB. Especially if dealing with small data amounts like 4GB, the overhead of distributing work can easily get larger than the actual evaluation effort.

4GB / 60k nodes is not large of XML databases, either. After some time of getting into it you will realize XQuery as a great tool for XML document analysis.

Is it Really?

Or do you get daily 4GB and have to evaluate that and all data you already stored? Then you will get to some amount which you cannot store and process on one machine any more; and distributing work will get necessary. Not within days or weeks, but a year will already bring you 1TB.

Converting to JSON

How does you input look like? Does it adhere any schema or even resemble tabular data? MongoDB's capabilities for analyzing semi-structured are way worse than what XML databases provide. On the other hand, if you only want to pull a few fields on well-defined paths and you can analyze one input file after the other, Mongo DB probably will not suffer much.

Carrying XML into the Cloud

If you want to use both an XML database's capabilities in analyzing the data and some NoSQL's systems capabilities in distributing the work, you could run the database from that system.

BaseX is getting to the cloud with exactly the capabilities you need -- but it will probably still take some time for that feature to get production-ready.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top