As mentioned in one of the comments, you can simply use a tuple of tags instead of a list of them which will work with the Counter
class in the collections
module. Here's how to do that using the list-based approach of the code in your question, along with a few optimizations since you have to process a large number of POS tags:
from collections import Counter
GROUP_SIZE = 5
counter = Counter()
mylist = []
with open("tags.txt", "r") as tagfile:
tags = (line.strip() for line in tagfile)
try:
while len(mylist) < GROUP_SIZE-1:
mylist.append(tags.next())
except StopIteration:
pass
for tag in tags: # main loop
mylist.pop(0)
mylist.append(tag)
counter.update((tuple(mylist),))
if len(counter) < 1:
print 'too few tags in file'
else:
for tags, count in counter.most_common(10): # top 10
print '{}, count = {:,d}'.format(list(tags), count)
However it would be even better to also use a deque
from the collections
module instead of a list
for what you're doing because the former have very efficient, O(1), appends and pops from either end vs O(n) with the latter. They also use less memory.
In addition to that, since Python v 2.6, they support a maxlen parameter which eliminates the need to explicitly pop()
elements off the end after the desired size has been reached -- so here's an even more efficient version based on them:
from collections import Counter, deque
GROUP_SIZE = 5
counter = Counter()
mydeque = deque(maxlen=GROUP_SIZE)
with open("tags.txt", "r") as tagfile:
tags = (line.strip() for line in tagfile)
try:
while len(mydeque) < GROUP_SIZE-1:
mydeque.append(tags.next())
except StopIteration:
pass
for tag in tags: # main loop
mydeque.append(tag)
counter.update((tuple(mydeque),))
if len(counter) < 1:
print 'too few tags in file'
else:
for tags, count in counter.most_common(10): # top 10
print '{}, count = {:,d}'.format(list(tags), count)