Python - cleaning data to run apriori algorithm

https://stackoverflow.com/questions/16511279

21-04-2022
|

Question

I have a master list of all words used in a set of articles, and now I'm trying to count the occurrence of each word in the master list within each article. I'm then going to try and build some association rules on the data. For example, My data might look like this:

master_wordlist = ['dog', 'cat', 'hat', 'bat', 'big']
article_a = ['dog', 'cat', 'dog','big']
article_b = ['dog', 'hat', 'big', 'big', 'big']

I need to get my data into this format:

Article        dog    cat    hat    bat    big
article_a      2      1      0      0      1
article_b      1      0      1      0      3

I'm struggling to make this transformation, I've been playing around with the nltk, but I can't figure out how to get a count where it includes the words that don't exist. Any help would be greatly appreciated!

Solution

You can use collections.Counter here:

from collections import Counter
master_wordlist = ['dog', 'cat', 'hat', 'bat', 'big']
article_a = ['dog', 'cat', 'dog','big']
article_b = ['dog', 'hat', 'big', 'big', 'big']

c_a = Counter(article_a)
c_b = Counter(article_b)

print [c_a[x] for x in master_wordlist]
print [c_b[x] for x in master_wordlist]

output:

[2, 1, 0, 0, 1]
[1, 0, 1, 0, 3]

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow