MachineX: Total Support Tree for Association Rule Generation

Table of contents

Reading Time: 5 minutes

In our previous blogs on Association Rule Learning, we have seen the FP-Tree and the FP-Growth algorithm. We also generated the frequent itemsets using FP-Growth. But a problem arises when we try to mine the association rules out of these frequent itemsets. Generally, the number of frequent itemsets is massive and to run an algorithm on them becomes very memory inefficient. So, to store these frequent itemsets efficiently, we use Total Support Trees or T-Trees.

What is Total Support Tree?

Total Support Tree is a compressed set enumeration tree. Well, that didn’t clear up much. Ok, let’s break that down into two terms, compressed and set enumeration tree. So, set enumeration tree is nothing but just another name for a well-known data structure Trie. Many of you might already know about it, but those of you who don’t, a trie is mostly used for storing strings in a memory efficient way, and in the same context, is a tree-like data structure wherein the nodes of the tree store the entire alphabet, and strings/words can be retrieved by traversing down a branch or path of the tree.

So, total support tree is just like a trie, but with a difference. Instead of creating multiple links for different nodes at the same level, it instead stores all the nodes of the same level in one array. This approach reduces the usage of links thus giving us compression. There is one last thing which is special about the total support tree, and that is that when we store an itemset in it, we store it in a frequency ascending order. This might be a little tricky to understand. So let’s understand this through an example, and while at it, we will also see how to construct a Total Support Tree. So, let’s dive into an example.

For example purpose, we will be taking a very small dataset so that the concept can be understood clearly. Our dataset is going to be – {{1, 3, 4}, {2, 4, 5}, {2, 4, 6}}. So, let’s go step by step. We will first be inserting {1, 3, 4}.

Step 1 – Rearrange the items in the itemset according to their frequency descending order. Below table shows us the single items in the dataset and their frequency –

So after rearranging, our itemset will become – {4, 1, 3}, since 4 is the most frequent with a frequency of 3, then 1 and 3 are equally frequent.

Step 2 – We will now initialize our T-Tree. Initially, our T-Tree would just contain a root node which is just a node with all the default values. It would have an array as its child and the number of places in that array would be equal to the number of items in our dataset, in this case, 6. It would look something like this –

Step 3 – Now we will add our frequency descending itemset, i.e., {4, 1, 3} that we obtained in Step 1 into our T-Tree. We will insert the itemset in a reverse fashion, i.e., first, we will insert 3. The array cell we choose to insert 3 depends on its position in the table that we saw before. In that table, the position of 3 is 3. So we will insert 3 at position 3. Now, we already know that the next item we get to insert in the T-Tree will obviously be more frequent than the one we just inserted. This is because we have arranged our itemset in a frequency descending order and are then traversing it in a reverse fashion. So, the next item will always have to be more frequent and thus will have a position lesser than the current item’s position in the previously shown table (from hereon referred to as header table). Because of this reason, the array cells in the child of the node with the item 3 will only have three places, as the maximum position that the new item can acquire is 2 (0-indexed). In the case where the item set was not in this order, or in any random order, we would not have been able to make an assumption like this, and we had to make an array of size equal to the total number of items in the header table. Thus, using the former approach, we achieve a very good amount of compression, and thus the name compressed set enumeration tree. So, now let’s look at how our T-Tree looks after item 3 is inserted into it.

Step 4 – We do not initialize the child array of a newly inserted node until we get a new element to insert. Thus the value null here. Now, we get the item 1 to insert. So, we will create an array of size 3, because item 3 was inserted at position 3. After the array is created, we will insert item 1 at the position which is given to us by the header table, i.e., 2.

Step 5 – Now we will insert the next item in our itemset, i.e., 4. So, again we will create a child array for item 1 with a size of 2.

We have successfully inserted the itemset in our T-Tree. Please note that at this point, we will also store the support related to the itemset at node 4, and for every other node, we will store the support value of 0, to signify that this is not a frequent itemset, while the itemset ending at 4 is since it will be having a support value. Obviously, as discussed in our previous blogs, we will be provided with a support threshold to decide whether the itemset is frequent or not.

After the insertion of all the three itemsets, we would get a T-Tree which is shown below:

This is our final T-Tree. This concludes our total support tree construction.

After it’s construction, we need to extract rules out of it, which is known as rule generation. We will see how to do that in our next blog. So, stay tuned!

References –