Question

I have this code

text = open("tags.txt", "r")
mylist = []
metalist = []

for line in text:
    mylist.append(line)

    if len(mylist) == 5:
        metalist.append(mylist)
        mylist.pop(0)

Which opens a text file with one POS tag per line. It then adds the first 5 POS tag list to mylist, which is then added to the metalist. It then moves down to the next line and creates the next sequence of 5 POS tags. The text file has about 110k~ tags total. I need to find the most common POS tag sequences from the metalist. I tried using the counter collection but lists are not hashable. What is the best way to approach this issue?

Was it helpful?

Solution

As mentioned in one of the comments, you can simply use a tuple of tags instead of a list of them which will work with the Counter class in the collections module. Here's how to do that using the list-based approach of the code in your question, along with a few optimizations since you have to process a large number of POS tags:

from collections import Counter

GROUP_SIZE = 5
counter = Counter()
mylist = []

with open("tags.txt", "r") as tagfile:
    tags = (line.strip() for line in tagfile)
    try:
        while len(mylist) < GROUP_SIZE-1:
            mylist.append(tags.next())
    except StopIteration:
        pass

    for tag in tags:   # main loop
        mylist.pop(0)
        mylist.append(tag)
        counter.update((tuple(mylist),))

if len(counter) < 1:
    print 'too few tags in file'
else:
    for tags, count in counter.most_common(10):  # top 10
        print '{}, count = {:,d}'.format(list(tags), count)

However it would be even better to also use a deque from the collections module instead of a list for what you're doing because the former have very efficient, O(1), appends and pops from either end vs O(n) with the latter. They also use less memory.

In addition to that, since Python v 2.6, they support a maxlen parameter which eliminates the need to explicitly pop() elements off the end after the desired size has been reached -- so here's an even more efficient version based on them:

from collections import Counter, deque

GROUP_SIZE = 5
counter = Counter()
mydeque = deque(maxlen=GROUP_SIZE)

with open("tags.txt", "r") as tagfile:
    tags = (line.strip() for line in tagfile)
    try:
        while len(mydeque) < GROUP_SIZE-1:
            mydeque.append(tags.next())
    except StopIteration:
        pass

    for tag in tags:   # main loop
        mydeque.append(tag)
        counter.update((tuple(mydeque),))

if len(counter) < 1:
    print 'too few tags in file'
else:
    for tags, count in counter.most_common(10):  # top 10
        print '{}, count = {:,d}'.format(list(tags), count)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top