Question

My application creates pieces of data that, in xml, would look like this:

<resource url="someurl">
   <term>
      <name>somename</name>
      <frequency>somenumber</frequency>
   </term>    
   ...
   ...
   ...
</resource>

This is how I'm storing these "resources" now. A resource per XML file. As many "term" per "resource" as needed. The problem is, I'll need to generate about 2 million of these resources. I've generated almost 500.000 and my mac isn't very happy about it. So my question is: how should I store this data?

  • A database? that would be hard, because the structure of the data isn't fixed...
  • Maybe merge some resources into larger XML files?
  • ...?

I don't need to change the data once it's created. Right now I'm accessing a specific resource by the name of that resource's file.

Any suggestions are greatly appreciated!

Was it helpful?

Solution

Not all databases are relational. Have a look at for example mongodb. It stores your data as json-like objects, similar to your resources.

An example using the shell:

$ mongo
> db.resources.save({url: "someurl", 
                     terms: [{name: "name1", frequency: 17.0},
                             {name: "name2", frequency: 42.0}]})
> db.resources.find()
{"_id" :  ObjectId( "4b00884b3a77b8b2fa3a8f77"), 
 "url" : "someurl" , 
 "terms" : [{"name" : "name1" , "frequency" : 17},
            {"name" : "name2" , "frequency" : 42}]}

OTHER TIPS

If your can't predict how your data is going to be organized, maybe http://couchdb.apache.org/ can be interesting for you. It is a schema-less database.

Anyways, XML is maybe not the best choice for big amout of data.

Maybe trying JSON or YAML works out better? They need less space and are easier to parse (I have however no experience on using those formats on larger scale. Maybe I'm wrong).

You should deffinetely have several resourses per XML file, but only if you are expected to have all the resources toguether at the same time. If you need to send only a handfull of resourses to anybody, then keep making the individual XML.

Even in that situation, you could keep the large XML file, and generate on demand the smaller ones from the original dataset.

Using a database like SQLite3 would allow you to have faster seek times and easier manipulation of the data, using SQL syntax.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top