Serializing corpora with python

https://stackoverflow.com/questions/11817673

24-06-2021
|

Question

I have a huge database of forum data. I need to extract corpora from the database for NLP purposes. The extracting step has parameters (for example FTS queries), and I'd like to save the corpus with the parameter metadata on the file system.

Some corpora will be dozens of megabytes large. What is the best way of saving a file with it's metadata, so that I can read the metadata without loading the entire file.

I am using the following technologies which might be relevant : PyQt, Postgres, Python, NLTK.

Some notes:

I want the corpus to be divorced from a heavyweight database.
I'd prefer not to use sqlite, as the metadata is very simple in structure.
Pickling doesn't allow partial unserialization from what I can tell.
I'd prefer not to have a separate metadata file.
I have experience with protocol buffers, but again It seems far too heavy handed.

I guess I could pickle the metadata to string and have the first line of the file represent the metadata. This seems to be the simplest way I think. that is, if the pickle format is ASCII-safe.

Solution

In the terminology of the NLTK, a "corpus" is the whole collection, and can consist of multiple files. Sounds like you can store each forum session (what you call a "corpus") into a separate file, using a structured format that allows you to store metadata in the beginning of the file.

The NLTK generally uses XML for this purpose, but it's not hard to roll your own corpus reader that reads a file header and then defers to PlainTextCorpusReader, or whatever standard reader best fits your file format. If you use XML, you'll also have to extend XMLCorpusReader and provide methods sents(), words(), etc.

OTHER TIPS

Why not adding a JSON header to your corpus file? Or any other kind of structured format... I can think right now of the YAML front matter in Jekyll posts.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow