Question

I have a large set of strings that I'm using for natural language processing research, and I'd like a nice way to store it in Python.

I could use pickle, but loading the entire list into memory would then be an impossibility (I believe), as it's about 10 GB large, and I don't have that much main memory. Currently I have the list stored with the shelve library... The shelf is indexed by strings, "0", "1", ..., "n" which is a bit clunky.

Are there nicer ways to store such an object in a single file, and still have random (ish) access to it?

It may be that the best option is to split it into multiple lists.

Thanks!

Was it helpful?

Solution

Depending upon how you intend to get at the data, SQLite3 might be the best approach. SQLite3 is excellent at random access to relational data, but if your data is not very relational, it might not make as much sense. (Even if all your have is an 'id' number and then your string, I think SQLite3 for underlying storage of your strings might be great.)

If you can figure out some mechanism to group together your strings by some way that you'd use them (say, if some of your sentences have implied objects or subjects, and you'd like to do research on them specifically; or depending upon the source of your strings, whether it be formal or informal or hyperinformal) or something like that, then you could reduce the 'working set' of your data significantly by partitioning it, and potentially drastically improving throughput of your research. But if you intend on truly random access then one big pile might be best.

Hope this helps.

OTHER TIPS

You could consider using a database; maybe a sentence or string table with one row for each string.

With the help of some Object Relational Mapper (e.g. sqlalchemy) you could have an object oriented view on the data and iterate over the strings, or work with larger subsets of your data sequentially (if that's applicable for your task).

Furthermore, you could store additional data for each sentence to gain a more fine-grained control over the sets of items you want to work with.

I would say use shelve (which uses a bdb backend) or Sqlite3.
I would go with SQLite3, for a simple list a table like CREATE TABLE list(idx int primary key, value text); should be enough.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top