Frequent Itemsets via Apriori Algorithm

Apriori function to extract frequent itemsets for association rule mining

from mlxtend.frequent_patterns import apriori

Overview

Apriori is a popular algorithm [1] for extracting frequent itemsets with applications in association rule learning. The apriori algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. A itemset is considered as "frequent" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur togehter in at least 50% of all transactions in the database.

References

[1] Agrawal, Rakesh, and Ramakrishnan Srikant. "Fast algorithms for mining association rules." Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994.

Example 1

The apriori function expects data in a one-hot encoded pandas DataFrame. Suppose we have the following transaction data:

dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

We can transform it into the right format via the OnehotTransactions encoder as follows:

import pandas as pd
from mlxtend.preprocessing import OnehotTransactions

oht = OnehotTransactions()
oht_ary = oht.fit(dataset).transform(dataset)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
df
Apple Corn Dill Eggs Ice cream Kidney Beans Milk Nutmeg Onion Unicorn Yogurt
0 0 0 0 1 0 1 1 1 1 0 1
1 0 0 1 1 0 1 0 1 1 0 1
2 1 0 0 1 0 1 1 0 0 0 0
3 0 1 0 0 0 1 1 0 0 1 1
4 0 1 0 1 1 1 0 0 1 0 0

Now, let us return the items and itemsets with at least 60% support:

from mlxtend.frequent_patterns import apriori

apriori(df, min_support=0.6)
support itemsets
0 0.8 [3]
1 1.0 [5]
2 0.6 [6]
3 0.6 [8]
4 0.6 [10]
5 0.8 [3, 5]
6 0.6 [3, 8]
7 0.6 [5, 6]
8 0.6 [5, 8]
9 0.6 [5, 10]
10 0.6 [3, 5, 8]

By default, apriori returns the column indices of the items, which may be useful in downstream operations such as association rule mining. For better readability, we can set use_colnames=True to convert these integer values into the respective item names:

apriori(df, min_support=0.6, use_colnames=True)
support itemsets
0 0.8 [Eggs]
1 1.0 [Kidney Beans]
2 0.6 [Milk]
3 0.6 [Onion]
4 0.6 [Yogurt]
5 0.8 [Eggs, Kidney Beans]
6 0.6 [Eggs, Onion]
7 0.6 [Kidney Beans, Milk]
8 0.6 [Kidney Beans, Onion]
9 0.6 [Kidney Beans, Yogurt]
10 0.6 [Eggs, Kidney Beans, Onion]

Example 2

The advantage of working with pandas DataFrames is that we can use its convenient features to filter the results. For instance, let's assume we are only interested in itemsets of length 2 that have a support of at least 80 percent. First, we create the frequent itemsets via apriori and add a new column that stores the length of each itemset:

frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets
support itemsets length
0 0.8 [Eggs] 1
1 1.0 [Kidney Beans] 1
2 0.6 [Milk] 1
3 0.6 [Onion] 1
4 0.6 [Yogurt] 1
5 0.8 [Eggs, Kidney Beans] 2
6 0.6 [Eggs, Onion] 2
7 0.6 [Kidney Beans, Milk] 2
8 0.6 [Kidney Beans, Onion] 2
9 0.6 [Kidney Beans, Yogurt] 2
10 0.6 [Eggs, Kidney Beans, Onion] 3

Then, we can select the results that satisfy our desired criteria as follows:

frequent_itemsets[ (frequent_itemsets['length'] == 2) &
                   (frequent_itemsets['support'] >= 0.8) ]
support itemsets length
5 0.8 [Eggs, Kidney Beans] 2

API

apriori(df, min_support=0.5, use_colnames=False, max_len=None)

Get frequent itemsets from a one-hot DataFrame Parameters

  • df : pandas DataFrame

    pandas DataFrame in one-hot encoded format. For example Apple Bananas Beer Chicken Milk Rice 0 1 0 1 1 0 1 1 1 0 1 0 0 1 2 1 0 1 0 0 0 3 1 1 0 0 0 0 4 0 0 1 1 1 1 5 0 0 1 0 1 1 6 0 0 1 0 1 0 7 1 1 0 0 0 0

  • min_support : float (default: 0.5)

    A float between 0 and 1 for minumum support of the itemsets returned. The support is computed as the fraction transactions_where_item(s)_occur / total_transactions.

  • use_colnames : bool (default: False)

    If true, uses the DataFrames' column names in the returned DataFrame instead of column indices.

  • max_len : int (default: None)

    Maximum length of the itemsets generated. If None (default) all possible itemsets lengths (under the apriori condition) are evaluated.

Returns

pandas DataFrame with columns ['support', 'itemsets'] of all itemsets that are >= min_support and < than max_len (if max_len is not None).