MachineX: Layman guide to Association Rule Learning

Reading Time: 6 minutes

Association rule learning is one of the most common techniques in data mining and as well as machine learning. The most common use, which I’m sure you all must be aware of, is the recommendation systems used by various e-shops like Amazon, Flipkart, etc.

Association rule learning is a technique to uncover the relationship between various items, elements, or more generally, various variables in a very large database. Building an analogy with the above examples of e-shops, it is the relationship between different items on the website. What I mean by that is, association rule learning tells us that if a user buys, say, a book, how likely is it that he buys another book, where these two books are related because other users have bought them both. Let me be more clear by giving you an example. Suppose you want to learn Scala, so you decide to go to Amazon and buy Scala Cookbook. When you open up its page and scroll down a little, you see this –


All the books in the above picture are ‘recommendations’ for the user who is currently viewing Scala Cookbook. As you can clearly see, ‘Frequently bought together’ section consists of the package, or item set, that a lot of users have bought, and ‘Customers who bought this item also bought’ section consists of items that users have bought individually after or before buying the ‘Scala Cookbook’. This has been made possible using a very large database and association rule learning. So, as you have probably already figured, association rule learning is basically finding out rules that associate different variables in a database.

From Wikipedia –

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.

Now the definition of Association rule learning is clear, but what is this ‘some measures of interestingness’? Let’s talk about that now, using an example.

Suppose a dataset exists such as the one below –


Transactions here are the items bought by different users. Books are the books that have been bought by the user. Don’t confuse transactions here with the normal meaning of transactions, wherein one transaction means a single instance of purchase by a user. Here, one transaction represents a set of books bought by different people at, maybe different points of time or maybe at the same time as well.

Now, looking at the data we can easily deduce itemsets like –

{Learning Scala, Learning Spark}
{Programming in Scala, Hadoop: The Definitive guide}

… and many more. But these itemsets in themselves don’t tell us much. For example, we can say that Learning Scala and Learning Spark are generally bought together, and similarly, Programming in Scala and Hadoop: The Definitive guide are bought together. But in a very large database, we aren’t interested in every itemset that can be mined from the data, but rather in those itemsets which are of some kind of interest, maybe from a business perspective or from some other perspective. Here, comes the role of the measures of interestingness, which are discussed below.


Support tells us that how frequent is an item, or an itemset, in all of the data. It basically tells us how popular an itemset is in the given dataset. For example, in the above-given dataset, if we look at Learning Spark, we can calculate its support by taking the number of transactions in which it has occurred and dividing it by the total number of transactions.

Support{Learning Spark} = 4/5
Support{Programming in Scala} = 2/5
Support{Learning Spark, Programming in Scala} = 1/5

Support tells us how important or interesting an itemset is, based on its number of occurrences. This is an important measure, as in real data there are millions and billions of records, and working on every itemset is pointless, as in millions of purchases if a user buys Programming in Scala and a cooking book, it would be of no interest to us.

But support alone isn’t enough. Although support is important, it alone doesn’t tell us the rules that are needed to actually take advantage of this large data. Till now we only looked at the itemsets that are mined from the given dataset, but a rule is something different. It’s not just a collection of books bought together by a user, but also tells us how those books are related. For example, in the above dataset, there is an itemset – {Learning Spark, Programming in Scala}, now looking at it, we cannot tell that whether people who buy Learning Spark also buy Programming in Scala or is it the other way round. For this purpose, we will look at another measure, i.e., confidence.


A rule consists of two parts – antecedent and consequent. For example, in Learning Spark -> Programming in Scala rule, Learning Spark is antecedent and Programming in Scala is consequent. Confidence tells us how likely is consequent when antecedent has occurred. Making it analogous to the above rule, how likely is it for someone to buy Programming in Scala when he has already bought Learning Spark.

Confidence is calculated using support values. For the rule Learning Spark -> Programming in Scala , the confidence will be calculated as follows –


which calculates to 1/4, which is 25%, whereas if we change the positions of antecedent and consequent in this rule, we will get 50%. This means that there is a 25% chance that if a user has bought Learning Spark, then he will also purchase Programming in Scala, but there is a 50% chance that if a user has bought Programming in Scala, then he will also buy Learning Spark. But there is still one problem with confidence. In this example, we got 25% for {Learning Spark -> Programming in Scala}, while we got 50% for the other way round. This happened because Programming in Scala isn’t very popular in the dataset with a support of only 2/5 while Learning in Spark has a support of 4/5. So the items aren’t really related to each other. If an item is frequent in a dataset, then there’s a high probability that a less frequent item’s transaction will contain the more frequent item, thus inflating the confidence.We can overcome this by dividing the support of the itemset with the product of the support of all the items present in the itemset to avoid fluke rules. This is known as lift.


Lift tells us how likely is consequent when antecedent has already occurred taking into account the support of both antecedent and consequent. For the above example, we can calculate lift as follows –


which gives us 5/8. A lift of less than one means that if antecedent has occurred, then it is unlikely that consequent will also occur. A lift of one means that both the antecedent and consequent are independent of each other. And a lift of more than one means that if antecedent occurs then it is likely that consequent will also occur. So, a value of 5/8 indicates that this rule is a fluke. Let’s consider the rule {Learning Scala -> Learning Spark} , we get its lift as 5/4, which is greater than one, meaning that this rule is certainly an interesting one.

Using these measures, various algorithms have been implemented to find the association rules from a database, like apriori and FP-growth. In my next blog, I will be discussing the implementation of Association rule learning. So stay tuned.

Thanks for reading!

References –

1. Association Rule Learning – KDnuggets
2. Association Rule Learning – Wikipedia
3. What is association rule learning? – TechTarget



Written by 

Akshansh Jain is a Software Consultant having more than 1 year of experience. He is familiar with Java but also has knowledge of various other programming languages such as scala, HTML and C++. He is also familiar with different Web Technologies and Android programming. He is a passionate programmer and always eager to learn new technologies & apply them in respective projects.