How to find Term frequency of a particular sets of tags in a document

Question

If I'm reading this correctly, you only count 1 ngram per line, so the line

"<author>James Parker</author><year>2008</year><lang>English</lang>"

has a trigram and 3 unigrams. You don't need all combinations for each line.

The simplest way to count this is to just use a dictionary accessed by the tag or tuple to store the count. That gives you a single pass and should scale well with the number of input lines. I use a regular expression to pull out the first of each tag (this means the input has to be well formed) and then just index into the counter by tag name and then by the n-tuple formed by the set of tag names.

import collections
import re

string = """<author>James Parker</author><year>2008</year><lang>English</lang>
<author>Van Wie</author><year>2002</year>
<year>2012</year><lang>English</lang>
<year>2002</year><lang>French</lang>"""

strings = string.split("\n")
counter = collections.Counter()

tag_re = "\<[^/\>]*\>"
for s in strings:
    tags = re.findall(tag_re, s)
    tags.sort()
    # use name directly
    for tag in tags:
        counter[tag] += 1
    # use set for ngram
    ngram = tuple(tags)
    counter[ngram] += 1

print counter

This prints:

Counter({'<year>': 4, '<lang>': 3, '<author>': 2, ('<year>', '<lang>'): 2, ('<author>', '<year>'): 1, ('<author>', '<year>', '<lang>'): 1})