Question

How can I find the frequency of each of these annotations; author, year, lang and also, the frequencies of occurence of their unigrams, bi-grams, trigrams...ngrams i.e.

"<author>James Parker</author><year>2008</year><lang>English</lang>"
"<author>Van Wie</author><year>2002</year>"
"<year>2012</year><lang>English</lang>"
"<year>2002</year><lang>French</lang>"


 file = 'file.csv'
 df = pd.read_csv(file)               
 lines = df['query']
 for line in lines:    

     #calculate tag frequency

  #calculate frequencies of unigram, bigrams, trigrams,....ngram tags 

> author: 3, year: 4, lang: 3

  trigram: author, year, lang : 1
  bigram: author, year: 1
  bigram: year, lang: 2
Was it helpful?

Solution

If I'm reading this correctly, you only count 1 ngram per line, so the line

"<author>James Parker</author><year>2008</year><lang>English</lang>" 

has a trigram and 3 unigrams. You don't need all combinations for each line.

The simplest way to count this is to just use a dictionary accessed by the tag or tuple to store the count. That gives you a single pass and should scale well with the number of input lines. I use a regular expression to pull out the first of each tag (this means the input has to be well formed) and then just index into the counter by tag name and then by the n-tuple formed by the set of tag names.

import collections
import re

string = """<author>James Parker</author><year>2008</year><lang>English</lang>
<author>Van Wie</author><year>2002</year>
<year>2012</year><lang>English</lang>
<year>2002</year><lang>French</lang>"""

strings = string.split("\n")
counter = collections.Counter()

tag_re = "\<[^/\>]*\>"
for s in strings:
    tags = re.findall(tag_re, s)
    tags.sort()
    # use name directly
    for tag in tags:
        counter[tag] += 1
    # use set for ngram
    ngram = tuple(tags)
    counter[ngram] += 1

print counter

This prints:

Counter({'<year>': 4, '<lang>': 3, '<author>': 2, ('<year>', '<lang>'): 2, ('<author>', '<year>'): 1, ('<author>', '<year>', '<lang>'): 1})
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top