문제

How can I find the frequency of each of these annotations; author, year, lang and also, the frequencies of occurence of their unigrams, bi-grams, trigrams...ngrams i.e.

"<author>James Parker</author><year>2008</year><lang>English</lang>"
"<author>Van Wie</author><year>2002</year>"
"<year>2012</year><lang>English</lang>"
"<year>2002</year><lang>French</lang>"


 file = 'file.csv'
 df = pd.read_csv(file)               
 lines = df['query']
 for line in lines:    

     #calculate tag frequency

  #calculate frequencies of unigram, bigrams, trigrams,....ngram tags 

> author: 3, year: 4, lang: 3

  trigram: author, year, lang : 1
  bigram: author, year: 1
  bigram: year, lang: 2
도움이 되었습니까?

해결책

If I'm reading this correctly, you only count 1 ngram per line, so the line

"<author>James Parker</author><year>2008</year><lang>English</lang>" 

has a trigram and 3 unigrams. You don't need all combinations for each line.

The simplest way to count this is to just use a dictionary accessed by the tag or tuple to store the count. That gives you a single pass and should scale well with the number of input lines. I use a regular expression to pull out the first of each tag (this means the input has to be well formed) and then just index into the counter by tag name and then by the n-tuple formed by the set of tag names.

import collections
import re

string = """<author>James Parker</author><year>2008</year><lang>English</lang>
<author>Van Wie</author><year>2002</year>
<year>2012</year><lang>English</lang>
<year>2002</year><lang>French</lang>"""

strings = string.split("\n")
counter = collections.Counter()

tag_re = "\<[^/\>]*\>"
for s in strings:
    tags = re.findall(tag_re, s)
    tags.sort()
    # use name directly
    for tag in tags:
        counter[tag] += 1
    # use set for ngram
    ngram = tuple(tags)
    counter[ngram] += 1

print counter

This prints:

Counter({'<year>': 4, '<lang>': 3, '<author>': 2, ('<year>', '<lang>'): 2, ('<author>', '<year>'): 1, ('<author>', '<year>', '<lang>'): 1})
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top