NLTK CorpusReader tokenize one file at the time

https://stackoverflow.com/questions/11814474

24-06-2021
|

Question

I have corpus of several hundred of documents and I am using NLTK PlaintextCorpusReader to process these files. The only problem is that I need to handle one file at the time in for cycle so I could count the similarity of these documents.

If I initialize the reader like this corpusReader = PlaintextCorpusReader(root, fileids = ".*") it just consumes all the documents and I can't find a way how to iterate over files instead of tokens.

One solution could be to initialize corpusReader for each file, iterate over its tokens and then again create new reader for another file but I think that this isn't very efficient way to process such large data.

Thanks for any advice :)

Solution

Ask the corpus for a list of its files and request the text one file at a time, like this:

for fname in corpusReader.fileids():
    tagged = nltk.batch_pos_tag(corpusReader.sents(fname))
    out = open("tagged/"+fname, "w")
    <write tagged text to <out>>

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow