Performance of Frequent Itemset mining

Question 1

What algorithm are you using, and what are your thresholds?

If you have n 1-itemsets that satisfy your minimum support, Apriori (and many others) must consider n*(n-1)/2 2-itemsets. This of course gets rather expensive. In Apriori, the 2-itemsets often is the largest and most expensive step (depending on your settings, 3-itemsets may be worse, but usually you don't get that far...)

Depending on your data characteristics, Eclat may perform better - or worse. FP-growth is very clever, but seems to be really tricky to implement right and efficiently.

There also exist a myriad of variants, some are probabilistic and may be substantially faster.

But in essence implementation and data set / parameter settings matter. One approach may be better than the other on the same set of data; and implementations may easily have a 10x performance difference. In particular for APRIORI, where many people don't fully understand the pruning tricks used, and end up doing much too much work.

For your example data set (which unfortunately is entirely unhelpful without a dictionary that explains the numbers), my APRIORI finishes in about 1 minute on a low-end Atom CPU with minSupport=1000. The 4-itemsets found were:

32, 38, 39, 48: 1236
32, 39, 41, 48: 1646
36, 38, 39, 48: 1080
38, 39, 41, 48: 1991
38, 39, 48, 110: 1031
38, 39, 48, 170: 1193

but as mentioned before, the parameters matter a lot with APRIORI. APRIORI scales well in the number of transactions (which may be more than fits into main memory), but it suffers from a large number of candidates, so you need to set minSupport high enough.

With minSupport=1000:

1-frequentItemsets (56)
2-frequentItemsets (49)
3-frequentItemsets (24)
4-frequentItemsets (6)
APRIORI runtime: 55401 ms

With minSupport=500:

1-frequentItemsets (185)
2-frequentItemsets (191)
3-frequentItemsets (79)
4-frequentItemsets (13)
5-frequentItemsets (0)
APRIORI runtime: 136594 ms

As you can see, runtime more than doubled. The reason is that the 1-itemsets grew, yielding many more 2-itemset candidates. At 3- and 4-itemsets, the difference is not so big anymore. But overall, 2 Minutes runtime on a really low end CPU isn't bad yet.

Decreasing minimum support to 250:

1-frequentItemsets (587)
2-frequentItemsets (640)
3-frequentItemsets (273)
4-frequentItemsets (54)
5-frequentItemsets (4)
APRIORI runtime: 954984 ms

Now runtime starts exploding. Almost 16 minutes to find the 5-itemsets:

32, 38, 39, 41, 48: 448
36, 38, 39, 41, 48: 334
38, 39, 41, 48, 110: 346
38, 39, 41, 48, 170: 413

as you can see, the 38, 39, 41, 48 4-itemset is playing a key role in this data set (but we already found this in the first run). We can now also give the confidence for 38, 39, 41, 48 -> 32, which will be the most confident rule involving 5 items: approximately 22.5% of transactions involving the first four also involved item 32. But given that AFAICT the meaning of the numbers for this data set is unknown, we do not know if this rule is actually interesting in practise, or just a toy exercise.

Going to minSupport=100, runtime further grows:

1-frequentItemsets (1857)
2-frequentItemsets (2785)
3-frequentItemsets (1475)
4-frequentItemsets (306)
5-frequentItemsets (28)
6-frequentItemsets (0)
APRIORI runtime: 8256507 ms

i.e. over two hours. Most of this time is spent on the 2-itemsets: there are 1.7 million candidates, out of which 2785 were frequent. For the 3-itemsets, one can roughly estimate that there will be just some thousand candidates, but not millions anymore. I have some plans on further improving the implementation by switching between two codepaths depending on the number of candidates. ('''Update:''' with a simpler modification, I further reduced the runtime by a factor of 20. So yes, '''implementation matters''', it can easily make a factor of 100 to 1000 or more - when APRIORI pruning is not fully implemented, for example)

If I ever find time to read up on the Eclat algorithm and reimplement it, I can update this with results. I believe it will be much faster here, and so will FP-growth.

Question 2

There is a pretty efficient algorithm proposed by Karp Papadimtrioue Shanke, that is finding candidates in a single traversal on the data (it was basically designed for stream processing) in order to find items that have frequency of at least theta for any theta in (0,1).

The algorithm in high level:

Collect elements into PF, counting their appearances
Whenever 1/θ distinct elements are encountered, decrease all counters by 1 and remove from PF those whose counters are zero
Output the elements in PF that survive this process

The algorithm yields 1/Theta (at most) candidates, and it has no false negatives (doesn't miss any candidate) - but it does have some false positives (some candidates are not frequent).

For simplicity, assuming 1/Theta is an integer.

Pseudo Code:

PF = {} //empty dictionary
for each element e:
   if PF.contains(e):
       PF[e] = PF[e] + 1
   else:
       PF[e] = 1
       if PF.size() == 1/Theta:
             for each element k in PF:
                PF[k] = PF[k] - 1
                if PF[k] == 0:
                     remove k from PF
When done, all elements in PF are your candidates.

Question 3

It depends on your implementation. There are many optimizations that can be made when implementating Apriori.

First, if you read the original Apriori paper by Agrawal & Srikant you will notice that they suggest using a special hash tree to count the support of candidates efficiently. If you want to see how this implementation work, there is a version of Apriori implemented with a HashTree in the SPMF open-source data mining library. It is named AprioriHT.

Another optimizations to avoid scanning the database multiple times is named AprioriTID. Each itemset is annotaed with the set of ids of transaction (tid set) containing it. Then when a candidate is generated by combining two itemsets A and B, to count the support of AUB directly without scanning the database, you can perform the intersection of the tid sets of A and B. For further optimization, you can implement tid sets with bit vectors. This version of APriori is called AprioriTID.

Another optimizations was proposed in the algorithm Pascal. It is to use the notion of generators to infer the support thresholds of some itemsets without scanning the database by using the concept of generator itemsets from Formal Concept Analysis.

Another optimization is to sort your itemsets for each level in Apriori by lexicographic order and also all items in each itemset by lexicographic order. Then, when generating candidates, you can use this ordering to avoid comparing all itemset with each other to generate larger candidates.

There are also other optimizations. On the FIMI website, there are various implementations with different optimizations.

If you want to look at an optimized version, you can have a look at my implementations in the SPMF data mining library in Java. Also, on the same website, you wil get implementations of faster algorithms such as FPGrowth.