Association Rules Generation from Frequent Itemsets

Function to generate association rules from frequent itemsets

from mlxtend.frequent_patterns import association_rules

Overview

Rule generation is a common task in the mining of frequent patterns. An association rule is an implication expression of the form , where and are disjoint itemsets [1]. A more concrete example based on consumer behaviour would be suggesting that people who buy diapers are also likely to buy beer. To evaluate the "interest" of such an association rule, different metrics have been developed. The current implementation make use of the confidence and lift metrics.

Metrics

The currently supported metrics for evaluating association rules and setting selection thresholds are listed below. Given a rule "A -> C", A stands for antecedant and C stands for consequent.

'support':

  • introduced in [3]

The support metric is defined for itemsets, not assocication rules, and computes the proportion of transactions that contain the antecedant A. Typically, support is used to measure the abundance or frequency (often interpreted as significance or importance) of an itemset in a database. We refer to an itemset as a "frequent itemset" if you support is larger than a specified minimum-support threshold. Note that due to the downward closure proporty, all subsets of a frequent itemset are also frequent.

'confidence':

  • introduced in [3]

The confidence of a rule A->C is the probability of seeing the consequent in a transaction given that it also contains the antecedent. Note that the metric is not symmetric or directed; for instance, the confidence for A->C is different than the confidence for C->A. The confidence is 1 (maximal) for a rule A->C if the consequent and antecedent always occur together.

'lift':

  • introduced in [4]

The lift metric is commonly used to measure how much more often the antecedent and consequent of a rule A->C occur together than we would expect if they were statistically independent. If A and C are independent, the Lift score will be exactly 1.

'leverage':

  • introduced in [5]

Leverage computes the difference between the observed frequency of A and C appearing together and the frequency that would be expected if A and C were independent. An leverage value of 0 indicates independence.

'conviction':

  • introduced in [6]

A high conviction value means that the consequent is highly depending on the antecedent. For instance, in the case of a perfect confidence score, the denominator becomes 0 (due to 1 - 1) for which the conviction score is defined as 'inf'. Similar to lift, if items are independent, the conviction is 1.

References

[1] Tan, Steinbach, Kumar. Introduction to Data Mining. Pearson New International Edition. Harlow: Pearson Education Ltd., 2014. (pp. 327-414).

[2] Michael Hahsler, http://michael.hahsler.net/research/association_rules/measures.html

[3] R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items in large databases. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data, pages 207-216, Washington D.C., May 1993

[4] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data

[5] Piatetsky-Shapiro, G., Discovery, analysis, and presentation of strong rules. Knowledge Discovery in Databases, 1991: p. 229-248.

[6] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Turk. Dynamic itemset counting and implication rules for market basket data. In SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, pages 255-264, Tucson, Arizona, USA, May 1997

Example 1

The generate_rules takes dataframes of frequent itemsets as produced by the apriori function in mlxtend.association. To demonstrate the usage of the generate_rules method, we first create a pandas DataFrame of frequent itemsets as generated by the apriori function:

import pandas as pd
from mlxtend.preprocessing import OnehotTransactions
from mlxtend.frequent_patterns import apriori


dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

oht = OnehotTransactions()
oht_ary = oht.fit(dataset).transform(dataset)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)

frequent_itemsets
support itemsets
0 0.8 [Eggs]
1 1.0 [Kidney Beans]
2 0.6 [Milk]
3 0.6 [Onion]
4 0.6 [Yogurt]
5 0.8 [Eggs, Kidney Beans]
6 0.6 [Eggs, Onion]
7 0.6 [Kidney Beans, Milk]
8 0.6 [Kidney Beans, Onion]
9 0.6 [Kidney Beans, Yogurt]
10 0.6 [Eggs, Kidney Beans, Onion]

The generate_rules() function allows you to (1) specify your metric of interest and (2) the according threshold. Currently implemented measures are confidence and lift. Let's say you are interesting in rules derived from the frequent itemsets only if the level of confidence is above the 90 percent threshold (min_threshold=0.7):

from mlxtend.frequent_patterns import association_rules

association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
antecedants consequents antecedent support consequent support support confidence lift leverage conviction
0 (Eggs) (Kidney Beans) 0.8 1.0 0.8 1.00 1.00 0.00 inf
1 (Kidney Beans) (Eggs) 1.0 0.8 0.8 0.80 1.00 0.00 1.000000
2 (Eggs) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.600000
3 (Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf
4 (Milk) (Kidney Beans) 0.6 1.0 0.6 1.00 1.00 0.00 inf
5 (Onion) (Kidney Beans) 0.6 1.0 0.6 1.00 1.00 0.00 inf
6 (Yogurt) (Kidney Beans) 0.6 1.0 0.6 1.00 1.00 0.00 inf
7 (Eggs, Onion) (Kidney Beans) 0.6 1.0 0.6 1.00 1.00 0.00 inf
8 (Eggs, Kidney Beans) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.600000
9 (Kidney Beans, Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf
10 (Eggs) (Kidney Beans, Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.600000
11 (Onion) (Eggs, Kidney Beans) 0.6 0.8 0.6 1.00 1.25 0.12 inf

Example 2

If you are interested in rules fulfilling a different interest metric, you can simply adjust the parameters. E.g. if you are interested only in rules that have a lift score of >= 1.2, you would do the following:

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules
antecedants consequents antecedent support consequent support support confidence lift leverage conviction
0 (Eggs) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.600000
1 (Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf
2 (Eggs, Kidney Beans) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.600000
3 (Kidney Beans, Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf
4 (Eggs) (Kidney Beans, Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.600000
5 (Onion) (Eggs, Kidney Beans) 0.6 0.8 0.6 1.00 1.25 0.12 inf

Pandas DataFrames make it easy to filter the results further. Let's say we are ony interested in rules that satisfy the following criteria:

  1. at least 2 antecedants
  2. a confidence > 0.75
  3. a lift score > 1.2

We could compute the antecedent length as follows:

rules["antecedant_len"] = rules["antecedants"].apply(lambda x: len(x))
rules
antecedants consequents antecedent support consequent support support confidence lift leverage conviction antecedant_len
0 (Eggs) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.600000 1
1 (Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 1
2 (Eggs, Kidney Beans) (Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.600000 2
3 (Kidney Beans, Onion) (Eggs) 0.6 0.8 0.6 1.00 1.25 0.12 inf 2
4 (Eggs) (Kidney Beans, Onion) 0.8 0.6 0.6 0.75 1.25 0.12 1.600000 1
5 (Onion) (Eggs, Kidney Beans) 0.6 0.8 0.6 1.00 1.25 0.12 inf 1

Then, we can use pandas' selection syntax as shown below:

rules[ (rules['antecedant_len'] >= 2) &
       (rules['confidence'] > 0.75) &
       (rules['lift'] > 1.2) ]
antecedants consequents antecedent support consequent support support confidence lift leverage conviction antecedant_len
3 (Kidney Beans, Onion) (Eggs) 0.6 0.8 0.6 1.0 1.25 0.12 inf 2

API

association_rules(df, metric='confidence', min_threshold=0.8)

Generates a DataFrame of association rules including the metrics 'score', 'confidence', and 'lift'

Parameters

  • df : pandas DataFrame

    pandas DataFrame of frequent itemsets with columns ['support', 'itemsets']

  • metric : string (default: 'confidence')

    Metric to evaluate if a rule is of interest. Supported metrics are 'support', 'confidence', 'lift', 'leverage', and 'conviction' These metrics are computed as follows: - support(A->C) = support(A+C) [aka 'support'], range: [0, 1] - confidence(A->C) = support(A+C) / support(A), range: [0, 1] - lift(A->C) = confidence(A->C) / support(C), range: [0, inf] - leverage(A->C) = support(A->C) - support(A)*support(C), range: [-1, 1] - conviction = [1 - support(C)] / [1 - confidence(A->C)], range: [0, inf]

  • min_threshold : float (default: 0.8)

    Minimal threshold for the evaluation metric to decide whether a candidate rule is of interest.

Returns

pandas DataFrame with columns "antecedent support", "consequent support", "support", "confidence", "lift", "leverage", "conviction" of all rules for which metric(rule) >= min_threshold.