MachineX: Frequent Itemset generation with the FP-Growth algorithm

Table of contents

Reading Time: 4 minutes

In our previous blog, MachineX: Understanding FP-Tree construction, we discussed the FP-Tree and its construction. In this blog, we will be discussing the FP-Growth algorithm, which uses FP-Tree to extract frequent itemsets in the given dataset.

FP-growth is an algorithm that generates frequent itemsets from an FP-tree by exploring the tree in a bottom-up fashion. We will be picking up the example we used in the previous blog while constructing the FP-Tree.

The final FP-Tree that was constructed in our previous blog is as shown in the below figure.

The algorithm will iterate the header table in a reverse order. So, first of all, it will find all the frequent items ending in p, then m, b, a, c, and finally f. Since every transaction is mapped onto a path in the FP-tree, we can derive the frequent itemsets ending with a particular item, say p, by examining only the paths containing node p. These paths can be accessed rapidly using the pointers associated with node p.

FP-growth finds all the frequent itemsets ending with a particular suffix by employing a divide-and-conquer strategy to split the problem into smaller subproblems. For example, suppose we are interested in finding all frequent itemsets ending in p. To do this, we must first check whether the itemset {p} itself is frequent. If it is frequent, we consider the subproblem of finding frequent itemsets ending in mp, followed by bp, ap, cp and fp. In turn, each of these subproblems is further decomposed into smaller subproblems. By merging the solutions obtained from the subproblems, all the frequent itemsets ending in p can be found. This divide-and-conquer approach is the key strategy employed by the FP-growth algorithm.

So, let’s dive into the steps to follow to find the frequent itemsets in the FP-Tree.

Step 1 – Header table would be iterated in a reverse order, so first, those frequent itemsets would be searched for which end with the item p. To do that, we will gather all the paths ending in node p. Now, the thing to remember here is that header table already consists of only frequent items, so p itself is frequent and we can expect itemsets ending with p to be frequent as well. These paths are known as the prefix paths. The below figure shows the prefix paths for node p.

Step 2 – Next step would be to update the support count of the nodes to only represent those paths which contain node p. For example, {f: 4, c: 3, a: 3, m: 2, p: 2} contains many paths without node p like {f, c, a}, so we have to update the support counts. We do this by adding the support count of node p to all of its parent nodes till the root node. Once the paths are updated with new support counts, we will eliminate all those items whose support count is less than the minimum support count, in this case, 3. The support count of items will be calculated by adding all the support counts of nodes containing that item in the prefix paths. This will give us a conditional FP-Tree. Conditional FP-Tree for node p is as shown below.

Support for c is 3, which is equal to the minimum support threshold provided. As we can conclude from the above conditional FP-Tree, {c, p} becomes a frequent itemset. Following this procedure, and recursively generating conditional FP-Trees and prefix paths, we get all the following patterns –

{p – 3}, {c, p – 3}, {m – 3}, {f, m – 3}, {c, f, m – 3}, {c, m – 3}, {a, m – 3}, {f, a, m – 3}, {c, f, a, m – 3}, {c, a, m – 3}, {b – 3}, {a – 3}, { f, a – 3}, {c, f, a – 3}, {c, a – 3}, {f – 4}, {c, f – 3}, {c – 4}

Above curly braces consists of itemset and support separated by a hyphen.

FP-growth is an interesting algorithm because it illustrates how a compact representation of the transaction data set helps to efficiently generate frequent itemsets. In addition, for certain transaction data sets, FP-growth outperforms the standard Apriori algorithm by several orders of magnitude. The run-time performance of FP-growth depends on the compaction factor of the dataset. If the resulting conditional FP-trees are very bushy (in the worst case, a full prefix tree), then the performance of the algorithm degrades significantly because it has to generate a large number of subproblems and merge the results returned by each subproblem.

From the next blog, we will be diving into how to extract association rules from the extracted frequent items. So, stay tuned!

References –