Is FPGrowth still considered “state of the art” in frequent pattern mining?

https://datascience.stackexchange.com/questions/730

16-10-2019
|

Question

As far as I know the development of algorithms to solve the Frequent Pattern Mining (FPM) problem, the road of improvements have some main checkpoints. Firstly, the Apriori algorithm was proposed in 1993, by Agrawal et al., along with the formalization of the problem. The algorithm was able to strip-off some sets from the 2^n - 1 sets (powerset) by using a lattice to maintain the data. A drawback of the approach was the need to re-read the database to compute the frequency of each set expanded.

Later, on year 1997, Zaki et al. proposed the algorithm Eclat, which inserted the resulting frequency of each set inside the lattice. This was done by adding, at each node of the lattice, the set of transaction-ids that had the items from root to the referred node. The main contribution is that one does not have to re-read the entire dataset to know the frequency of each set, but the memory required to keep such data structure built may exceed the size of the dataset itself.

In 2000, Han et al. proposed an algorithm named FPGrowth, along with a prefix-tree data structure named FPTree. The algorithm was able to provide significant data compression, while also granting that only frequent itemsets would be yielded (without candidate itemset generation). This was done mainly by sorting the items of each transaction in decreasing order, so that the most frequent items are the ones with the least repetitions in the tree data structure. Since the frequency only descends while traversing the tree in-depth, the algorithm is able to strip-off non-frequent itemsets.

Edit:

As far as I know, this may be considered a state-of-the-art algorithm, but I'd like to know about other proposed solutions. What other algorithms for FPM are considered "state-of-the-art"? What is the intuition/main-contribution of such algorithms?

Is the FPGrowth algorithm still considered "state of the art" in frequent pattern mining? If not, what algorithm(s) may extract frequent itemsets from large datasets more efficiently?

Solution

State of the art as in: used in practise or worked on in theory?

APRIORI is used everywhere, except in developing new frequent itemset algorithms. It's easy to implement, and easy to reuse in very different domains. You'll find hundreds of APRIORI implementations of varying quality. And it's easy to get APRIORI wrong, actually.

FPgrowth is much harder to implement, but also much more interesting. So from an academic point of view, everybody tries to improve FPgrowth - getting work based on APRIORI accepted will be very hard by now.

If you have a good implementation, every algorithm has it's good and it's bad situations in my opinion. A good APRIORI implementation will only need to scan the database k times to find all frequent itemsets of length k. In particular if your data fits into main memory this is cheap. What can kill APRIORI is too many frequent 2-itemsets (in particular when you don't use a Trie and similar acceleration techniques etc.). It works best on large data with a low number of frequent itemsets.

Eclat works on columns; but it needs to read each column much more often. There is some work on diffsets to reduce this work. If your data does not fit into main memory, Eclat suffers probably more than Apriori. By going depth first, it will also be able to return a first interesting result much earlier than Apriori, and you can use these results to adjust parameters; so you need less iterations to find good parameters. But by design, it cannot exploit pruning as neatly as Apriori did.

FPGrowth compresses the data set into the tree. This works best when you have lots of duplicate records. You could probably reap of quite some gains for Apriori and Eclat too if you can presort your data and merge duplicates into weighted vectors. FPGrowth does this at an extreme level. The drawback is that the implementation is much harder; and once this tree does not fit into memory anymore it gets a mess to implement.

As for performance results and benchmarks - don't trust them. There are so many things to implement incorrectly. Try 10 different implementations, and you get 10 very different performance results. In particular for APRIORI, I have the impression that most implementations are broken in the sense of missing some of the main contributions of APRIORI... and of those that have these parts right, the quality of optimizations varies a lot.

There are actually even papers on how to implement these algorithms efficiently:

Efficient Implementations of Apriori and Eclat.
Christian Borgelt
Workshop of Frequent Item Set Mining Implementations (FIMI 2003, Melbourne, FL, USA).

You may also want to read these surveys on this domain:

Goethals, Bart. "Survey on frequent pattern mining." Univ. of Helsinki (2003).
Ferenc Bodon, A Survey on Frequent Itemset Mining, Technical Report, Budapest University of Technology and Economic, 2006,
Frequent Item Set Mining
Christian Borgelt
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(6):437-456. 2012

OTHER TIPS

Most of the recent Frequent Pattern approaches that I've seen in the literature are based on optimizing FPGrowth. I have to admit, I haven't seen many developments within the literature in FPM in many years.

This wikibook highlights many of the variants on FPGrowth that are out there.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange