“Cloning” a corpus in NLTK?

https://stackoverflow.com/questions/10874994

12-06-2021
|

Question

I'm attempting to create my own corpus in NLTK. I've been reading some of the documentation on this and it seems rather complicated... all I wanted to do is "clone" the movie reviews corpus but with my own text. Now, I know I can just change files in the move reviews corpus to my own... but that limits me to working with just one such corpus at a time (ie. I'd have to continually be swapping files). is there any way i could just clone the movie reviews corpus?

thanks Alex

Solution

The movie reviews are read with the CategorizedPlaintextCorpusReader class. Use it directly to load your corpus. The following should work for an exact copy of the movies corpus:

mr = CategorizedPlaintextCorpusReader(path_to_your_reviews, r'(?!\.).*\.txt',
        cat_pattern=r'(neg|pos)/.*')

Whatever maches inside cat_pattern are the categories: In this case, neg and pos. If your corpus has different categories (e.g., movie genres rather than positive/negative evaluations), change the directory structure and adjust the cat_pattern parameter to match.

PS. For categorized corpora with different structure, the nltk offers a wealth of ways to specify the category; read the documentation of CategorizedPlaintextCorpusReader.

OTHER TIPS

Why don't you a define a new corpus by copying the definition of movie_reviews in nltk.corpus? You can do this all you want with new directories, and then copy the directory structure and replace the files.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow