Question

I'm attempting to create my own corpus in NLTK. I've been reading some of the documentation on this and it seems rather complicated... all I wanted to do is "clone" the movie reviews corpus but with my own text. Now, I know I can just change files in the move reviews corpus to my own... but that limits me to working with just one such corpus at a time (ie. I'd have to continually be swapping files). is there any way i could just clone the movie reviews corpus?

thanks Alex

Était-ce utile?

La solution

The movie reviews are read with the CategorizedPlaintextCorpusReader class. Use it directly to load your corpus. The following should work for an exact copy of the movies corpus:

mr = CategorizedPlaintextCorpusReader(path_to_your_reviews, r'(?!\.).*\.txt',
        cat_pattern=r'(neg|pos)/.*')

Whatever maches inside cat_pattern are the categories: In this case, neg and pos. If your corpus has different categories (e.g., movie genres rather than positive/negative evaluations), change the directory structure and adjust the cat_pattern parameter to match.

PS. For categorized corpora with different structure, the nltk offers a wealth of ways to specify the category; read the documentation of CategorizedPlaintextCorpusReader.

Autres conseils

Why don't you a define a new corpus by copying the definition of movie_reviews in nltk.corpus? You can do this all you want with new directories, and then copy the directory structure and replace the files.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top