Association rules

association_rules: Association rules generation from frequent itemsets

Function to generate association rules from frequent itemsets

from mlxtend.frequent_patterns import association_rules

Overview

Rule generation is a common task in the mining of frequent patterns. An association rule is an implication expression of the form $X \rightarrow Y$ , where $X$ and $Y$ are disjoint itemsets [1]. A more concrete example based on consumer behaviour would be $\{Diapers\} \rightarrow \{Beer\}$ suggesting that people who buy diapers are also likely to buy beer. To evaluate the "interest" of such an association rule, different metrics have been developed. The current implementation make use of the confidence and lift metrics.

Metrics

The currently supported metrics for evaluating association rules and setting selection thresholds are listed below. Given a rule "A -> C", A stands for antecedent and C stands for consequent.

'support':

$\text{support}(A\rightarrow C) = \text{support}(A \cup C), \;\;\; \text{range: } [0, 1]$

introduced in [3]

The support metric is defined for itemsets, not assocication rules. The table produced by the association rule mining algorithm contains three different support metrics: 'antecedent support', 'consequent support', and 'support'. Here, 'antecedent support' computes the proportion of transactions that contain the antecedent A, and 'consequent support' computes the support for the itemset of the consequent C. The 'support' metric then computes the support of the combined itemset A $\cup$ C.

Typically, support is used to measure the abundance or frequency (often interpreted as significance or importance) of an itemset in a database. We refer to an itemset as a "frequent itemset" if you support is larger than a specified minimum-support threshold. Note that in general, due to the downward closure property, all subsets of a frequent itemset are also frequent.

'confidence':

$\text{confidence}(A\rightarrow C) = \frac{\text{support}(A\rightarrow C)}{\text{support}(A)}, \;\;\; \text{range: } [0, 1]$

introduced in [3]

The confidence of a rule A->C is the probability of seeing the consequent in a transaction given that it also contains the antecedent. Note that the metric is not symmetric or directed; for instance, the confidence for A->C is different than the confidence for C->A. The confidence is 1 (maximal) for a rule A->C if the consequent and antecedent always occur together.

'lift':

$\text{lift}(A\rightarrow C) = \frac{\text{confidence}(A\rightarrow C)}{\text{support}(C)}, \;\;\; \text{range: } [0, \infty]$

introduced in [4]

The lift metric is commonly used to measure how much more often the antecedent and consequent of a rule A->C occur together than we would expect if they were statistically independent. If A and C are independent, the Lift score will be exactly 1.

'leverage':

$\text{levarage}(A\rightarrow C) = \text{support}(A\rightarrow C) - \text{support}(A) \times \text{support}(C), \;\;\; \text{range: } [-1, 1]$

introduced in [5]

Leverage computes the difference between the observed frequency of A and C appearing together and the frequency that would be expected if A and C were independent. A leverage value of 0 indicates independence.

'conviction':

$\text{conviction}(A\rightarrow C) = \frac{1 - \text{support}(C)}{1 - \text{confidence}(A\rightarrow C)}, \;\;\; \text{range: } [0, \infty]$

introduced in [6]

A high conviction value means that the consequent is highly depending on the antecedent. For instance, in the case of a perfect confidence score, the denominator becomes 0 (due to 1 - 1) for which the conviction score is defined as 'inf'. Similar to lift, if items are independent, the conviction is 1.

'zhangs_metric':

$\text{zhangs metric}(A\rightarrow C) = \frac{\text{confidence}(A\rightarrow C) - \text{confidence}(A'\rightarrow C)}{Max[ \text{confidence}(A\rightarrow C) , \text{confidence}(A'\rightarrow C)]}, \;\;\; \text{range: } [-1, 1]$

introduced in [7]

Measures both association and dissociation. Value ranges between -1 and 1. A positive value (>0) indicates Association and negative value indicated dissociation.

'jaccard':

$\text{jaccard}(A\rightarrow C) = \frac{\text{support}(A\rightarrow C)}{\text{support}(A) + \text{support}(C) - \text{support}(A\rightarrow C)}, \;\;\; \text{range: } [0, 1]$

introduced in [8]

Measures similarity between A and C. Value ranges between 0 and 1. A value of 0 indicates complete dissimilarity, and a value of 1 indicates complete similarity.

'certainty':

$\text{certainty}(A\rightarrow C) = \frac{\text{confidence}(A\rightarrow C) - \text{support}(C)}{1 - \text{support}(C)}, \;\;\; \text{range: } [-1, 1]$

introduced in [9]

Measures the certainty between A and C. Value ranges from -1 and 1, where 0 indicates independence.

'kulczynski':

$\text{Kulczynski}(A\rightarrow C) = \frac{1}{2}\left(\frac{\text{support}(A\rightarrow C)}{\text{support}(A)} + \frac{\text{support}(A\rightarrow C)}{\text{support}(C)}\right), \;\;\; \text{range: } [0, 1]$

introduced in [10]

Measures the association between A and C. Value ranges from 0 to 1. Rules near 0 or 1 are considered negatively or positively associated, respectively. Rules near 0.5 are considered to be unintresting.

Generating association rules with th existence of missing values

As it is already implemented in the FP-Growth/FP-Max algorithms, now the corresponding association rules are generated while addressing the issue of missing information in the input. As before, the supports are used from the algorithm and using these the rest of the metrics are re-formulated in a different way. We still use the so called "disabled" array from the original dataframe, where it swaps the null values with ones and the rest with NaNs. For these association rules to make sense, a count corresponding to each sub-case is kept. The cases are when a null value is present in the antecedent, consequent and the combination of both respectively and when there's a NaN in the consequent and all the rest are present in the antecedent and vice versa. According to [11], the metrics are re-defined below:

'support':

$\text{Support}(A\rightarrow C) = \frac{|B_{AC}|}{|B| - |\text{Dis}(AC)|}, \;\;\; \text{range: } [0, 1]$

where $|B_{AC}|$ is the count of both A and C occuring/existing, $|B|$ is the number of transactions and $|\text{Dis}(AC)|$ is the count kept if there is a NaN either in A or C, since

$\text{Dis}(AC) = \text{Dis}(A)\cup\text{Dis}(C)$

'confidence':

$\text{Confidence}(A\rightarrow C) = \frac{|B_{AC}|}{|B_{A}| - |\text{Dis}(C)\cap B_{A}|}, \;\;\; \text{range: } [0, 1]$

where $|\text{Dis}(C)\cap B_{A}|$ is the count kept if there is a NaN in C AND an occurence of existence in A. In the code, this formula has been re-arranged using the supports obtained from the algorithm and is formulated as sAC*(num_itemsets - disAC) / (sA*(num_itemsets - disA) - dis_int) where sAC*(num_itemsets - disAC) is the count kept both in A and C, sA*(num_itemsets - disA) is the count kept in A and dis_int is the term mentioned above.

'representativity':

$\text{Representativity}(A) = \frac{|B| - |\text{Dis}(A)|}{|B|}, \;\;\; \text{range: } [0, 1]$

introduced in [11]

A new metric induced according to [11], that essentially represents how much information is present in itemset A across all the transactions in the database.

The rest of the metrics are derived according to re-formulated support and confidence metrics, while their formulas are kept identical as before but given the "new" support and confidence.

References

[1] Tan, Steinbach, Kumar. Introduction to Data Mining. Pearson New International Edition. Harlow: Pearson Education Ltd., 2014. (pp. 327-414).

[2] Michael Hahsler, https://michael.hahsler.net/research/association_rules/measures.html

[3] R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items in large databases. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data, pages 207-216, Washington D.C., May 1993

[4] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data

[5] Piatetsky-Shapiro, G., Discovery, analysis, and presentation of strong rules. Knowledge Discovery in Databases, 1991: p. 229-248.

[6] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Turk. Dynamic itemset counting and implication rules for market basket data. In SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, pages 255-264, Tucson, Arizona, USA, May 1997

[7] Xiaowei Yan , Chengqi Zhang & Shichao Zhang (2009) CONFIDENCE METRICS FOR ASSOCIATION RULE MINING, Applied Artificial Intelligence, 23:8, 713-737 https://www.tandfonline.com/doi/pdf/10.1080/08839510903208062.

[8] Pang-Ning Tan, Vipin Kumar, Jaideep Srivastava. Selecting the right objective measure for association analysis. Information Systems, Volume 29, Issue 4, 2004, Pages 293-313.

[9] Berzal Fernando, Blanco Ignacio, Sánchez Daniel, Vila, María-Amparo. Measuring the accuracy and interest of association rules: A new framework. Intelligent Data Analysis, Volume 6, no. 3, 2002, Pages 221-235.

[10] Wu, T., Chen, Y., Han, J. Re-examination of interestingness measures in pattern mining: a unified framework. Data Min Knowl Disc 21, 371–397 (2010). https://doi.org/10.1007/s10618-009-0161-2.

[11] Ragel, A. and Crémilleux, B., 1998. "Treatment of missing values for association rules". In Research and Development in Knowledge Discovery and Data Mining: Second Pacific-Asia Conference, PAKDD-98 Melbourne, Australia, April 15–17, 1998 Proceedings 2 (pp. 258-270). Springer Berlin Heidelberg.

Example 1 -- Generating Association Rules from Frequent Itemsets

The generate_rules takes dataframes of frequent itemsets as produced by the apriori, fpgrowth, or fpmax functions in mlxtend.association. To demonstrate the usage of the generate_rules method, we first create a pandas DataFrame of frequent itemsets as generated by the fpgrowth function:

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth


dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
### alternatively:
#frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
#frequent_itemsets = fpmax(df, min_support=0.6, use_colnames=True)

frequent_itemsets

	support	itemsets
0	1.0	(Kidney Beans)
1	0.8	(Eggs)
2	0.6	(Yogurt)
3	0.6	(Milk)
4	0.6	(Onion)
5	0.8	(Kidney Beans, Eggs)
6	0.6	(Kidney Beans, Yogurt)
7	0.6	(Kidney Beans, Milk)
8	0.6	(Onion, Eggs)
9	0.6	(Kidney Beans, Onion)
10	0.6	(Kidney Beans, Onion, Eggs)

The generate_rules() function allows you to (1) specify your metric of interest and (2) the according threshold. Currently implemented measures are confidence and lift. Let's say you are interested in rules derived from the frequent itemsets only if the level of confidence is above the 70 percent threshold (min_threshold=0.7):

from mlxtend.frequent_patterns import association_rules

association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7, num_itemsets=len(df.index))

/home/marcelo/anaconda3/envs/analysis/lib/python3.10/site-packages/mlxtend/frequent_patterns/association_rules.py:182: RuntimeWarning: invalid value encountered in divide
  cert_metric = np.where(certainty_denom == 0, 0, certainty_num / certainty_denom)

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski
0	(Kidney Beans)	(Eggs)	1.0	0.8	0.8	0.80	1.00	1.0	0.00	1.0	0.0	0.80	0.000	0.900
1	(Eggs)	(Kidney Beans)	0.8	1.0	0.8	1.00	1.00	1.0	0.00	inf	0.0	0.80	0.000	0.900
2	(Yogurt)	(Kidney Beans)	0.6	1.0	0.6	1.00	1.00	1.0	0.00	inf	0.0	0.60	0.000	0.800
3	(Milk)	(Kidney Beans)	0.6	1.0	0.6	1.00	1.00	1.0	0.00	inf	0.0	0.60	0.000	0.800
4	(Onion)	(Eggs)	0.6	0.8	0.6	1.00	1.25	1.0	0.12	inf	0.5	0.75	1.000	0.875
5	(Eggs)	(Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875
6	(Onion)	(Kidney Beans)	0.6	1.0	0.6	1.00	1.00	1.0	0.00	inf	0.0	0.60	0.000	0.800
7	(Kidney Beans, Onion)	(Eggs)	0.6	0.8	0.6	1.00	1.25	1.0	0.12	inf	0.5	0.75	1.000	0.875
8	(Kidney Beans, Eggs)	(Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875
9	(Onion, Eggs)	(Kidney Beans)	0.6	1.0	0.6	1.00	1.00	1.0	0.00	inf	0.0	0.60	0.000	0.800
10	(Onion)	(Kidney Beans, Eggs)	0.6	0.8	0.6	1.00	1.25	1.0	0.12	inf	0.5	0.75	1.000	0.875
11	(Eggs)	(Kidney Beans, Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875

Example 2 -- Rule Generation and Selection Criteria

If you are interested in rules according to a different metric of interest, you can simply adjust the metric and min_threshold arguments . E.g. if you are only interested in rules that have a lift score of >= 1.2, you would do the following:

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2, num_itemsets=len(df.index))
rules

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski
0	(Onion)	(Eggs)	0.6	0.8	0.6	1.00	1.25	1.0	0.12	inf	0.5	0.75	1.000	0.875
1	(Eggs)	(Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875
2	(Kidney Beans, Onion)	(Eggs)	0.6	0.8	0.6	1.00	1.25	1.0	0.12	inf	0.5	0.75	1.000	0.875
3	(Kidney Beans, Eggs)	(Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875
4	(Onion)	(Kidney Beans, Eggs)	0.6	0.8	0.6	1.00	1.25	1.0	0.12	inf	0.5	0.75	1.000	0.875
5	(Eggs)	(Kidney Beans, Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875

Pandas DataFrames make it easy to filter the results further. Let's say we are ony interested in rules that satisfy the following criteria:

at least 2 antecedents
a confidence > 0.75
a lift score > 1.2

We could compute the antecedent length as follows:

rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski	antecedent_len
0	(Onion)	(Eggs)	0.6	0.8	0.6	1.00	1.25	1.0	0.12	inf	0.5	0.75	1.000	0.875	1
1	(Eggs)	(Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875	1
2	(Kidney Beans, Onion)	(Eggs)	0.6	0.8	0.6	1.00	1.25	1.0	0.12	inf	0.5	0.75	1.000	0.875	2
3	(Kidney Beans, Eggs)	(Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875	2
4	(Onion)	(Kidney Beans, Eggs)	0.6	0.8	0.6	1.00	1.25	1.0	0.12	inf	0.5	0.75	1.000	0.875	1
5	(Eggs)	(Kidney Beans, Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875	1

Then, we can use pandas' selection syntax as shown below:

rules[ (rules['antecedent_len'] >= 2) &
       (rules['confidence'] > 0.75) &
       (rules['lift'] > 1.2) ]

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski	antecedent_len
2	(Kidney Beans, Onion)	(Eggs)	0.6	0.8	0.6	1.0	1.25	1.0	0.12	inf	0.5	0.75	1.0	0.875	2

Similarly, using the Pandas API, we can select entries based on the "antecedents" or "consequents" columns:

rules[rules['antecedents'] == {'Eggs', 'Kidney Beans'}]

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski	antecedent_len
3	(Kidney Beans, Eggs)	(Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875	2

Frozensets

Note that the entries in the "itemsets" column are of type frozenset, which is built-in Python type that is similar to a Python set but immutable, which makes it more efficient for certain query or comparison operations (https://docs.python.org/3.6/library/stdtypes.html#frozenset). Since frozensets are sets, the item order does not matter. I.e., the query

rules[rules['antecedents'] == {'Eggs', 'Kidney Beans'}]

is equivalent to any of the following three

rules[rules['antecedents'] == {'Kidney Beans', 'Eggs'}]
rules[rules['antecedents'] == frozenset(('Eggs', 'Kidney Beans'))]
rules[rules['antecedents'] == frozenset(('Kidney Beans', 'Eggs'))]

Example 3 -- Frequent Itemsets with Incomplete Antecedent and Consequent Information

Most metrics computed by association_rules depends on the consequent and antecedent support score of a given rule provided in the frequent itemset input DataFrame. Consider the following example:

import pandas as pd

dict = {'itemsets': [['177', '176'], ['177', '179'],
                     ['176', '178'], ['176', '179'],
                     ['93', '100'], ['177', '178'],
                     ['177', '176', '178']],
        'support':[0.253623, 0.253623, 0.217391,
                   0.217391, 0.181159, 0.108696, 0.108696]}

freq_itemsets = pd.DataFrame(dict)
freq_itemsets

	itemsets	support
0	[177, 176]	0.253623
1	[177, 179]	0.253623
2	[176, 178]	0.217391
3	[176, 179]	0.217391
4	[93, 100]	0.181159
5	[177, 178]	0.108696
6	[177, 176, 178]	0.108696

Note that this is a "cropped" DataFrame that doesn't contain the support values of the item subsets. This can create problems if we want to compute the association rule metrics for, e.g., 176 => 177.

For example, the confidence is computed as

$\text{confidence}(A\rightarrow C) = \frac{\text{support}(A\rightarrow C)}{\text{support}(A)}, \;\;\; \text{range: } [0, 1]$

But we do not have $\text{support}(A)$ . All we know about "A"'s support is that it is at least 0.253623.

In these scenarios, where not all metric's can be computed, due to incomplete input DataFrames, you can use the support_only=True option, which will only compute the support column of a given rule that does not require as much info:

$\text{support}(A\rightarrow C) = \text{support}(A \cup C), \;\;\; \text{range: } [0, 1]$

"NaN's" will be assigned to all other metric columns:

from mlxtend.frequent_patterns import association_rules

res = association_rules(freq_itemsets, support_only=True, min_threshold=0.1, num_itemsets=0)
res

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski
0	(176)	(177)	NaN	NaN	0.253623	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	(177)	(176)	NaN	NaN	0.253623	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	(179)	(177)	NaN	NaN	0.253623	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	(177)	(179)	NaN	NaN	0.253623	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	(176)	(178)	NaN	NaN	0.217391	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	(178)	(176)	NaN	NaN	0.217391	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	(176)	(179)	NaN	NaN	0.217391	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	(179)	(176)	NaN	NaN	0.217391	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	(100)	(93)	NaN	NaN	0.181159	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	(93)	(100)	NaN	NaN	0.181159	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	(178)	(177)	NaN	NaN	0.108696	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	(177)	(178)	NaN	NaN	0.108696	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12	(176, 178)	(177)	NaN	NaN	0.108696	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13	(176, 177)	(178)	NaN	NaN	0.108696	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	(178, 177)	(176)	NaN	NaN	0.108696	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	(176)	(178, 177)	NaN	NaN	0.108696	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16	(178)	(176, 177)	NaN	NaN	0.108696	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
17	(177)	(176, 178)	NaN	NaN	0.108696	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

To clean up the representation, you may want to do the following:

res = res[['antecedents', 'consequents', 'support']]
res

	antecedents	consequents	support
0	(176)	(177)	0.253623
1	(177)	(176)	0.253623
2	(179)	(177)	0.253623
3	(177)	(179)	0.253623
4	(176)	(178)	0.217391
5	(178)	(176)	0.217391
6	(176)	(179)	0.217391
7	(179)	(176)	0.217391
8	(100)	(93)	0.181159
9	(93)	(100)	0.181159
10	(178)	(177)	0.108696
11	(177)	(178)	0.108696
12	(176, 178)	(177)	0.108696
13	(176, 177)	(178)	0.108696
14	(178, 177)	(176)	0.108696
15	(176)	(178, 177)	0.108696
16	(178)	(176, 177)	0.108696
17	(177)	(176, 178)	0.108696

Example 4 -- Pruning Association Rules

There is no specific API for pruning. Instead, the pandas API can be used on the resulting data frame to remove individual rows. E.g., suppose we have the following rules:

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
from mlxtend.frequent_patterns import association_rules


dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2, num_itemsets=len(df.index))
rules

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski
0	(Onion)	(Eggs)	0.6	0.8	0.6	1.00	1.25	1.0	0.12	inf	0.5	0.75	1.000	0.875
1	(Eggs)	(Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875
2	(Kidney Beans, Onion)	(Eggs)	0.6	0.8	0.6	1.00	1.25	1.0	0.12	inf	0.5	0.75	1.000	0.875
3	(Kidney Beans, Eggs)	(Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875
4	(Onion)	(Kidney Beans, Eggs)	0.6	0.8	0.6	1.00	1.25	1.0	0.12	inf	0.5	0.75	1.000	0.875
5	(Eggs)	(Kidney Beans, Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875

and we want to remove the rule "(Onion, Kidney Beans) -> (Eggs)". In order to to this, we can define selection masks and remove this row as follows:

antecedent_sele = rules['antecedents'] == frozenset({'Onion', 'Kidney Beans'}) # or  frozenset({'Kidney Beans', 'Onion'})
consequent_sele = rules['consequents'] == frozenset({'Eggs'})
final_sele = (antecedent_sele & consequent_sele)

rules.loc[ ~final_sele ]

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski
0	(Onion)	(Eggs)	0.6	0.8	0.6	1.00	1.25	1.0	0.12	inf	0.5	0.75	1.000	0.875
1	(Eggs)	(Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875
3	(Kidney Beans, Eggs)	(Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875
4	(Onion)	(Kidney Beans, Eggs)	0.6	0.8	0.6	1.00	1.25	1.0	0.12	inf	0.5	0.75	1.000	0.875
5	(Eggs)	(Kidney Beans, Onion)	0.8	0.6	0.6	0.75	1.25	1.0	0.12	1.6	1.0	0.75	0.375	0.875

Example 5 -- Generating Association Rules from data with missing information

import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
from mlxtend.frequent_patterns import association_rules


dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

rows, columns = df.shape
idx = np.random.randint(0, rows, 10)
col = np.random.randint(0, columns, 10)

for i in range(10):
    df.iloc[idx[i], col[i]] = np.nan

df

/tmp/ipykernel_34953/2823279667.py:23: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.iloc[idx[i], col[i]] = np.nan
/tmp/ipykernel_34953/2823279667.py:23: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.iloc[idx[i], col[i]] = np.nan
/tmp/ipykernel_34953/2823279667.py:23: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.iloc[idx[i], col[i]] = np.nan
/tmp/ipykernel_34953/2823279667.py:23: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.iloc[idx[i], col[i]] = np.nan
/tmp/ipykernel_34953/2823279667.py:23: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.iloc[idx[i], col[i]] = np.nan
/tmp/ipykernel_34953/2823279667.py:23: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.iloc[idx[i], col[i]] = np.nan
/tmp/ipykernel_34953/2823279667.py:23: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.iloc[idx[i], col[i]] = np.nan

	Apple	Corn	Dill	Eggs	Ice cream	Kidney Beans	Milk	Nutmeg	Onion	Unicorn	Yogurt
0	False	False	False	True	False	True	True	True	True	False	NaN
1	False	False	True	True	False	NaN	NaN	True	NaN	False	NaN
2	True	False	False	True	False	True	True	False	False	False	False
3	False	True	False	False	False	True	True	False	False	True	True
4	False	NaN	False	True	NaN	True	False	False	NaN	NaN	False

The example below, shows the same implementations as above but with the case when a dataset has missing values. The function still allows you to (1) specify your metric of interest and (2) the according threshold. Now we have to set null_values=True to both fpgrowth/fpmax and also include the original df and its size as parameters to the function. We try the example below using metric="confidence" and min_threshold=0.8.

frequent_itemsets = fpgrowth(df, min_support=0.6, null_values = True, use_colnames=True)
# frequent_itemsets = fpmax(df, min_support=0.6, null_values = True, use_colnames=True)
rules = association_rules(frequent_itemsets, len(df), df, null_values = True, metric="confidence", min_threshold=0.8)
rules

/home/marcelo/anaconda3/envs/analysis/lib/python3.10/site-packages/mlxtend/frequent_patterns/association_rules.py:182: RuntimeWarning: invalid value encountered in divide
  cert_metric = np.where(certainty_denom == 0, 0, certainty_num / certainty_denom)

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	representativity	leverage	conviction	zhangs_metric	jaccard	certainty	kulczynski
0	(Eggs)	(Kidney Beans)	0.80	1.0	0.75	1.0	1.0	0.8	-0.05	inf	-0.25	0.714286	0.0	0.875
1	(Milk)	(Kidney Beans)	0.75	1.0	0.75	1.0	1.0	0.8	0.00	inf	0.00	0.750000	0.0	0.875

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search