fpmax: Maximal itemsets via the FP-Max algorithm
Function implementing FP-Max to extract maximal itemsets for association rule mining
from mlxtend.frequent_patterns import fpmax
Overview
The Apriori algorithm is among the first and most popular algorithms for frequent itemset generation (frequent itemsets are then used for association rule mining). However, the runtime of Apriori can be quite large, especially for datasets with a large number of unique items, as the runtime grows exponentially depending on the number of unique items.
In contrast to Apriori, FP-Growth is a frequent pattern generation algorithm that inserts items into a pattern search tree, which allows it to have a linear increase in runtime with respect to the number of unique items or entries.
FP-Max is a variant of FP-Growth, which focuses on obtaining maximal itemsets. An itemset X is said to maximal if X is frequent and there exists no frequent super-pattern containing X. In other words, a frequent pattern X cannot be sub-pattern of larger frequent pattern to qualify for the definition maximal itemset.
References
- [1] Grahne, G., & Zhu, J. (2003, November). Efficiently using prefix-trees in mining frequent itemsets. In FIMI (Vol. 90).
Related
Example 1 -- Maximal Itemsets
The fpmax
function expects data in a one-hot encoded pandas DataFrame.
Suppose we have the following transaction data:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
We can transform it into the right format via the TransactionEncoder
as follows:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df
Apple | Corn | Dill | Eggs | Ice cream | Kidney Beans | Milk | Nutmeg | Onion | Unicorn | Yogurt | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | True | False | True | True | True | True | False | True |
1 | False | False | True | True | False | True | False | True | True | False | True |
2 | True | False | False | True | False | True | True | False | False | False | False |
3 | False | True | False | False | False | True | True | False | False | True | True |
4 | False | True | False | True | True | True | False | False | True | False | False |
Now, let us return the items and itemsets with at least 60% support:
from mlxtend.frequent_patterns import fpmax
fpmax(df, min_support=0.6)
support | itemsets | |
---|---|---|
0 | 0.6 | (5, 6) |
1 | 0.6 | (8, 3, 5) |
2 | 0.6 | (10, 5) |
By default, fpmax
returns the column indices of the items, which may be useful in downstream operations such as association rule mining. For better readability, we can set use_colnames=True
to convert these integer values into the respective item names:
fpmax(df, min_support=0.6, use_colnames=True)
support | itemsets | |
---|---|---|
0 | 0.6 | (Kidney Beans, Milk) |
1 | 0.6 | (Onion, Eggs, Kidney Beans) |
2 | 0.6 | (Kidney Beans, Yogurt) |
More Examples
Please note that since the fpmax
function is a drop-in replacement for fpgrowth
and apriori
, it comes with the same set of function arguments and return arguments. Thus, for more examples, please see the apriori
documentation.
API
fpmax(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0)
Get maximal frequent itemsets from a one-hot DataFrame
Parameters
-
df
: pandas DataFramepandas DataFrame the encoded format. Also supports DataFrames with sparse data; for more info, please see (https://pandas.pydata.org/pandas-docs/stable/ user_guide/sparse.html#sparse-data-structures)
Please note that the old pandas SparseDataFrame format is no longer supported in mlxtend >= 0.17.2.
The allowed values are either 0/1 or True/False. For example,
Apple Bananas Beer Chicken Milk Rice
0 True False True True False True
1 True False True False False True
2 True False True False False False
3 True True False False False False
4 False False True True True True
5 False False True False True True
6 False False True False True False
7 True True False False False False
-
min_support
: float (default: 0.5)A float between 0 and 1 for minimum support of the itemsets returned. The support is computed as the fraction transactions_where_item(s)_occur / total_transactions.
-
use_colnames
: bool (default: False)If true, uses the DataFrames' column names in the returned DataFrame instead of column indices.
-
max_len
: int (default: None)Given the set of all maximal itemsets, return those that are less than
max_len
. IfNone
(default) all possible itemsets lengths are evaluated. -
verbose
: int (default: 0)Shows the stages of conditional tree generation.
Returns
pandas DataFrame with columns ['support', 'itemsets'] of all maximal
itemsets that are >= min_support
and < than max_len
(if max_len
is not None).
Each itemset in the 'itemsets' column is of type frozenset
,
which is a Python built-in type that behaves similarly to
sets except that it is immutable
(For more info, see
https://docs.python.org/3.6/library/stdtypes.html#frozenset).
Examples
For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpmax/