fpgrowth: Frequent itemsets via the FP-growth algorithm
Function implementing FP-Growth to extract frequent itemsets for association rule mining
from mlxtend.frequent_patterns import fpgrowth
Overview
FP-Growth [1] is an algorithm for extracting frequent itemsets with applications in association rule learning that emerged as a popular alternative to the established Apriori algorithm [2].
In general, the algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. An itemset is considered as "frequent" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur together in at least 50% of all transactions in the database.
In particular, and what makes it different from the Apriori frequent pattern mining algorithm, FP-Growth is an frequent pattern mining algorithm that does not require candidate generation. Internally, it uses a so-called FP-tree (frequent pattern tree) datastrucure without generating the candidate sets explicitly, which makes it particularly attractive for large datasets.
A new feature is implemented in this algorithm, which is the sub-case when the input contains missing information [3]. The same structure and logic of the algorithm is kept, while "ignoring" the missing values in the data. That gives a more realistic indication of the frequency of existence in the items/itemsets that are generated from the algorithm. The support is computed differently where for a single item, the cardinality of null values is deducted from the cardinality of all transactions in the database. For the case of an itemset, of more than one elements, the cardinality of null values in at least one item in them itemset is deducted from the cardinality of all transactions in the database.
References
[1] Han, Jiawei, Jian Pei, Yiwen Yin, and Runying Mao. "Mining frequent patterns without candidate generation. "A frequent-pattern tree approach." Data mining and knowledge discovery 8, no. 1 (2004): 53-87.
[2] Agrawal, Rakesh, and Ramakrishnan Srikant. "Fast algorithms for mining association rules." Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994.
[3] Ragel, A. and Crémilleux, B., 1998. "Treatment of missing values for association rules". In Research and Development in Knowledge Discovery and Data Mining: Second Pacific-Asia Conference, PAKDD-98 Melbourne, Australia, April 15–17, 1998 Proceedings 2 (pp. 258-270). Springer Berlin Heidelberg.
Related
Example 1 -- Generating Frequent Itemsets
The fpgrowth function expects data in a one-hot encoded pandas DataFrame.
Suppose we have the following transaction data:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
We can transform it into the right format via the TransactionEncoder as follows:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df
| Apple | Corn | Dill | Eggs | Ice cream | Kidney Beans | Milk | Nutmeg | Onion | Unicorn | Yogurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | True | False | True | True | True | True | False | True |
| 1 | False | False | True | True | False | True | False | True | True | False | True |
| 2 | True | False | False | True | False | True | True | False | False | False | False |
| 3 | False | True | False | False | False | True | True | False | False | True | True |
| 4 | False | True | False | True | True | True | False | False | True | False | False |
Now, let us return the items and itemsets with at least 60% support:
from mlxtend.frequent_patterns import fpgrowth
fpgrowth(df, min_support=0.6)
| support | itemsets | |
|---|---|---|
| 0 | 1.0 | (5) |
| 1 | 0.8 | (3) |
| 2 | 0.6 | (10) |
| 3 | 0.6 | (8) |
| 4 | 0.6 | (6) |
| 5 | 0.8 | (3, 5) |
| 6 | 0.6 | (10, 5) |
| 7 | 0.6 | (8, 3) |
| 8 | 0.6 | (8, 5) |
| 9 | 0.6 | (8, 3, 5) |
| 10 | 0.6 | (5, 6) |
By default, fpgrowth returns the column indices of the items, which may be useful in downstream operations such as association rule mining. For better readability, we can set use_colnames=True to convert these integer values into the respective item names:
fpgrowth(df, min_support=0.6, use_colnames=True)
| support | itemsets | |
|---|---|---|
| 0 | 1.0 | (Kidney Beans) |
| 1 | 0.8 | (Eggs) |
| 2 | 0.6 | (Yogurt) |
| 3 | 0.6 | (Onion) |
| 4 | 0.6 | (Milk) |
| 5 | 0.8 | (Eggs, Kidney Beans) |
| 6 | 0.6 | (Yogurt, Kidney Beans) |
| 7 | 0.6 | (Eggs, Onion) |
| 8 | 0.6 | (Onion, Kidney Beans) |
| 9 | 0.6 | (Eggs, Onion, Kidney Beans) |
| 10 | 0.6 | (Milk, Kidney Beans) |
The example below implements the algorithm when there is missing information from the data, by arbitrarily removing datapoints from the original dataset.
import numpy as np
from mlxtend.frequent_patterns import fpgrowth
rows, columns = df.shape
idx = np.random.randint(0, rows, 10)
col = np.random.randint(0, columns, 10)
for i in range(10):
df.iloc[idx[i], col[i]] = np.nan
df
C:\Users\User\AppData\Local\Temp\ipykernel_1940\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
df.iloc[idx[i], col[i]] = np.nan
C:\Users\User\AppData\Local\Temp\ipykernel_1940\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
df.iloc[idx[i], col[i]] = np.nan
C:\Users\User\AppData\Local\Temp\ipykernel_1940\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
df.iloc[idx[i], col[i]] = np.nan
C:\Users\User\AppData\Local\Temp\ipykernel_1940\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
df.iloc[idx[i], col[i]] = np.nan
C:\Users\User\AppData\Local\Temp\ipykernel_1940\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
df.iloc[idx[i], col[i]] = np.nan
C:\Users\User\AppData\Local\Temp\ipykernel_1940\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
df.iloc[idx[i], col[i]] = np.nan
C:\Users\User\AppData\Local\Temp\ipykernel_1940\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
df.iloc[idx[i], col[i]] = np.nan
| Apple | Corn | Dill | Eggs | Ice cream | Kidney Beans | Milk | Nutmeg | Onion | Unicorn | Yogurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | True | False | True | True | True | True | NaN | NaN |
| 1 | False | NaN | True | True | False | True | False | True | True | False | True |
| 2 | True | False | False | True | False | True | True | False | False | False | False |
| 3 | False | True | False | False | NaN | NaN | True | NaN | False | NaN | True |
| 4 | False | True | False | True | NaN | True | False | False | NaN | False | False |
The same function as above is applied by setting null_values=True with at least 60% support:
fpgrowth(df, min_support=0.6, null_values = True, use_colnames=True)
| support | itemsets | |
|---|---|---|
| 0 | 1.0 | (Kidney Beans) |
| 1 | 0.8 | (Eggs) |
| 2 | 0.6 | (Milk) |
| 3 | 1.0 | (Eggs, Kidney Beans) |
Example 2 -- Apriori versus FPGrowth
Since FP-Growth doesn't require creating candidate sets explicitly, it can be magnitudes faster than the alternative Apriori algorithm. For instance, the following cells compare the performance of the Apriori algorithm to the performance of FP-Growth -- even in this very simple toy dataset scenario, FP-Growth is about 5 times faster.
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
from mlxtend.frequent_patterns import apriori
%timeit -n 100 -r 10 apriori(df, min_support=0.6)
850 µs ± 39.3 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit -n 100 -r 10 apriori(df, min_support=0.6, low_memory=True)
941 µs ± 30.6 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
from mlxtend.frequent_patterns import fpgrowth
%timeit -n 100 -r 10 fpgrowth(df, min_support=0.6)
320 µs ± 9.21 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
More Examples
Please note that since the fpgrowth function is a drop-in replacement for apriori, it comes with the same set of function arguments and return arguments. Thus, for more examples, please see the apriori documentation.
API
fpgrowth(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0)
Get frequent itemsets from a one-hot DataFrame
Parameters
-
df: pandas DataFramepandas DataFrame the encoded format. Also supports DataFrames with sparse data; for more info, please see https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#sparse-data-structures.
Please note that the old pandas SparseDataFrame format is no longer supported in mlxtend >= 0.17.2.
The allowed values are either 0/1 or True/False. For example,
Apple Bananas Beer Chicken Milk Rice 0 True False True True False True 1 True False True False False True 2 True False True False False False 3 True True False False False False 4 False False True True True True 5 False False True False True True 6 False False True False True False 7 True True False False False False -
min_support: float (default: 0.5)A float between 0 and 1 for minimum support of the itemsets returned. The support is computed as the fraction transactions_where_item(s)_occur / total_transactions.
-
use_colnames: bool (default: False)If true, uses the DataFrames' column names in the returned DataFrame instead of column indices.
-
max_len: int (default: None)Maximum length of the itemsets generated. If
None(default) all possible itemsets lengths are evaluated. -
verbose: int (default: 0)Shows the stages of conditional tree generation.
Returns
pandas DataFrame with columns ['support', 'itemsets'] of all itemsets
that are >= min_support and < than max_len
(if max_len is not None).
Each itemset in the 'itemsets' column is of type frozenset,
which is a Python built-in type that behaves similarly to
sets except that it is immutable
(For more info, see
https://docs.python.org/3.6/library/stdtypes.html#frozenset).
Examples
For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpgrowth/