如何更快地计算NLTK PlaintextCorpus中的单词？

https://stackoverflow.com/questions/3902044

29-09-2019
|

题

我有一组文档，我想返回一个元组的列表，其中每个元组都有给定文档的日期以及该文档中给定搜索词的次数。我的代码（下）有效，但很慢，我是N00B。有明显的方法可以更快地做到这一点吗？任何帮助都将不胜感激，主要是这样，我可以学习更好的编码，也可以更快地完成该项目！

def searchText(searchword):
    counts = []
    corpus_root = 'some_dir'
    wordlists = PlaintextCorpusReader(corpus_root, '.*')
    for id in wordlists.fileids():
        date = id[4:12]
        month = date[-4:-2]
        day = date[-2:]
        year = date[:4]
        raw = wordlists.raw(id)
        tokens = nltk.word_tokenize(raw)
        text = nltk.Text(tokens)
        count = text.count(searchword)
        counts.append((month, day, year, count))

    return counts

解决方案

如果您只想要一个单词计数的频率，那么您就不需要创建 nltk.Text 对象，甚至使用 nltk.PlainTextReader. 。相反，直接去 nltk.FreqDist.

files = list_of_files
fd = nltk.FreqDist()
for file in files:
    with open(file) as f:
        for sent in nltk.sent_tokenize(f.lower()):
            for word in nltk.word_tokenize(sent):
                fd.inc(word)

或者，如果您不想进行任何分析 - 只需使用 dict.

files = list_of_files
fd = {}
for file in files:
    with open(file) as f:
        for sent in nltk.sent_tokenize(f.lower()):
            for word in nltk.word_tokenize(sent):
                try:
                    fd[word] = fd[word]+1
                except KeyError:
                    fd[word] = 1

这些可以通过发电机表达式使这些更有效，但是我用于可读性循环。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow