I have corpus of several hundred of documents and I am using NLTK PlaintextCorpusReader to process these files. The only problem is that I need to handle one file at the time in for cycle so I could count the similarity of these documents.

If I initialize the reader like this corpusReader = PlaintextCorpusReader(root, fileids = ".*") it just consumes all the documents and I can't find a way how to iterate over files instead of tokens.

One solution could be to initialize corpusReader for each file, iterate over its tokens and then again create new reader for another file but I think that this isn't very efficient way to process such large data.

Thanks for any advice :)

有帮助吗?

解决方案

Ask the corpus for a list of its files and request the text one file at a time, like this:

for fname in corpusReader.fileids():
    tagged = nltk.batch_pos_tag(corpusReader.sents(fname))
    out = open("tagged/"+fname, "w")
    <write tagged text to <out>>
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top