문제

I have been examining different sources on the web and have tried various methods but could only find how to count the frequency of unique words but not unique phrases. The code I have so far is as follows:

import collections
import re
wanted = set(['inflation', 'gold', 'bank'])
cnt = collections.Counter()
words = re.findall('\w+', open('02.2003.BenBernanke.txt').read().lower())
for word in words:
    if word in wanted:
        cnt [word] += 1
print (cnt)

If possible, I would also like to count the number of times the phrases 'central bank' and 'high inflation' is used in this text. I appreciate any suggestion or guidance you can give.

도움이 되었습니까?

해결책

First of all, this is how I would generate the cnt that you do (to reduce memory overhead)

def findWords(filepath):
  with open(filepath) as infile:
    for line in infile:
      words = re.findall('\w+', line.lower())
      yield from words

cnt = collections.Counter(findWords('02.2003.BenBernanke.txt'))

Now, on to your question about phrases:

from itertools import tee
phrases = {'central bank', 'high inflation'}
fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))   
next(fw2)
for w1,w2 in zip(fw1, fw2)):
  phrase = ' '.join([w1, w2])
  if phrase in phrases:
    cnt[phrase] += 1

Hope this helps

다른 팁

To count literal occurrences of couple of phrases in a small file:

with open("input_text.txt") as file:
    text = file.read()
n = text.count("high inflation rate")

There is nltk.collocations module that provides tools to identify words that often appear consecutively:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder

# run nltk.download() if there are files missing
words = [word.casefold() for sentence in sent_tokenize(text)
         for word in word_tokenize(sentence)]
words_fd = nltk.FreqDist(words)
bigram_fd = nltk.FreqDist(nltk.bigrams(words))
finder = BigramCollocationFinder(word_fd, bigram_fd)
bigram_measures = nltk.collocations.BigramAssocMeasures()
print(finder.nbest(bigram_measures.pmi, 5))
print(finder.score_ngrams(bigram_measures.raw_freq))

# finder can be constructed from words directly
finder = TrigramCollocationFinder.from_words(words)
# filter words
finder.apply_word_filter(lambda w: w not in wanted)
# top n results
trigram_measures = nltk.collocations.TrigramAssocMeasures()
print(sorted(finder.nbest(trigram_measures.raw_freq, 2)))

Assuming the file is not huge - this is the easiest way

for w1, w2 in zip(words, words[1:]):
    phrase = w1 + " " + w2
    if phrase in wanted:
        cnt[phrase] += 1
print(cnt)
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top