Fpmax - mlxtend

fpmax: Maximal itemsets via the FP-Max algorithm

Function implementing FP-Max to extract maximal itemsets for association rule mining

from mlxtend.frequent_patterns import fpmax

Overview

The Apriori algorithm is among the first and most popular algorithms for frequent itemset generation (frequent itemsets are then used for association rule mining). However, the runtime of Apriori can be quite large, especially for datasets with a large number of unique items, as the runtime grows exponentially depending on the number of unique items.

In contrast to Apriori, FP-Growth is a frequent pattern generation algorithm that inserts items into a pattern search tree, which allows it to have a linear increase in runtime with respect to the number of unique items or entries.

FP-Max is a variant of FP-Growth, which focuses on obtaining maximal itemsets. An itemset X is said to maximal if X is frequent and there exists no frequent super-pattern containing X. In other words, a frequent pattern X cannot be sub-pattern of larger frequent pattern to qualify for the definition maximal itemset.

Same as in FP-Growth, a new feature is implemented in this algorithm, which is the sub-case when the input contains missing information [2]. The same structure and logic of the algorithm is kept, while "ignoring" the missing values in the data. That gives a more realistic indication of the frequency of existence in the items/itemsets that are generated from the algorithm. The support is computed differently where for a single item, the cardinality of null values is deducted from the cardinality of all transactions in the database. For the case of an itemset, of more than one elements, the cardinality of null values in at least one item in them itemset is deducted from the cardinality of all transactions in the database.

References

[1] Grahne, G., & Zhu, J. (2003, November). Efficiently using prefix-trees in mining frequent itemsets. In FIMI (Vol. 90).
[2] Ragel, A. and Crémilleux, B., 1998. "Treatment of missing values for association rules". In Research and Development in Knowledge Discovery and Data Mining: Second Pacific-Asia Conference, PAKDD-98 Melbourne, Australia, April 15–17, 1998 Proceedings 2 (pp. 258-270). Springer Berlin Heidelberg.

Example 1 -- Maximal Itemsets

The fpmax function expects data in a one-hot encoded pandas DataFrame. Suppose we have the following transaction data:

dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

We can transform it into the right format via the TransactionEncoder as follows:

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

	Apple	Corn	Dill	Eggs	Ice cream	Kidney Beans	Milk	Nutmeg	Onion	Unicorn	Yogurt
0	False	False	False	True	False	True	True	True	True	False	True
1	False	False	True	True	False	True	False	True	True	False	True
2	True	False	False	True	False	True	True	False	False	False	False
3	False	True	False	False	False	True	True	False	False	True	True
4	False	True	False	True	True	True	False	False	True	False	False

Now, let us return the items and itemsets with at least 60% support:

from mlxtend.frequent_patterns import fpmax

fpmax(df, min_support=0.6)

	support	itemsets
0	0.6	(5, 6)
1	0.6	(8, 3, 5)
2	0.6	(10, 5)

By default, fpmax returns the column indices of the items, which may be useful in downstream operations such as association rule mining. For better readability, we can set use_colnames=True to convert these integer values into the respective item names:

fpmax(df, min_support=0.6, use_colnames=True)

	support	itemsets
0	0.6	(Kidney Beans, Milk)
1	0.6	(Onion, Eggs, Kidney Beans)
2	0.6	(Kidney Beans, Yogurt)

The example below implements the algorithm when there is missing information from the data, by arbitrarily removing datapoints from the original dataset.

import numpy as np
from mlxtend.frequent_patterns import fpmax

rows, columns = df.shape
idx = np.random.randint(0, rows, 10)
col = np.random.randint(0, columns, 10)

for i in range(10):
    df.iloc[idx[i], col[i]] = np.nan

df

C:\Users\User\AppData\Local\Temp\ipykernel_2788\2505244757.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.iloc[idx[i], col[i]] = np.nan
C:\Users\User\AppData\Local\Temp\ipykernel_2788\2505244757.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.iloc[idx[i], col[i]] = np.nan
C:\Users\User\AppData\Local\Temp\ipykernel_2788\2505244757.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.iloc[idx[i], col[i]] = np.nan
C:\Users\User\AppData\Local\Temp\ipykernel_2788\2505244757.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.iloc[idx[i], col[i]] = np.nan
C:\Users\User\AppData\Local\Temp\ipykernel_2788\2505244757.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.iloc[idx[i], col[i]] = np.nan
C:\Users\User\AppData\Local\Temp\ipykernel_2788\2505244757.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.iloc[idx[i], col[i]] = np.nan
C:\Users\User\AppData\Local\Temp\ipykernel_2788\2505244757.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.iloc[idx[i], col[i]] = np.nan

	Apple	Corn	Dill	Eggs	Ice cream	Kidney Beans	Milk	Nutmeg	Onion	Unicorn	Yogurt
0	False	False	False	True	False	True	NaN	True	True	False	True
1	False	NaN	True	True	NaN	True	False	True	True	False	NaN
2	NaN	False	False	True	False	True	NaN	False	NaN	NaN	False
3	False	True	False	False	False	True	True	False	False	True	True
4	False	True	False	True	True	True	False	False	NaN	False	False

The same function as above is applied by setting null_values=True with at least 60% support:

fpmax(df, min_support=0.6, null_values = True, use_colnames=True)

	support	itemsets
0	0.666667	(Eggs, Kidney Beans, Onion)

More Examples

Please note that since the fpmax function is a drop-in replacement for fpgrowth and apriori, it comes with the same set of function arguments and return arguments. Thus, for more examples, please see the apriori documentation.

API

fpmax(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0)

Get maximal frequent itemsets from a one-hot DataFrame

Parameters

df : pandas DataFrame

pandas DataFrame the encoded format. Also supports DataFrames with sparse data; for more info, please see (https://pandas.pydata.org/pandas-docs/stable/ user_guide/sparse.html#sparse-data-structures)

Please note that the old pandas SparseDataFrame format is no longer supported in mlxtend >= 0.17.2.

The allowed values are either 0/1 or True/False. For example,

    Apple  Bananas   Beer  Chicken   Milk   Rice
    0   True    False   True     True  False   True
    1   True    False   True    False  False   True
    2   True    False   True    False  False  False
    3   True     True  False    False  False  False
    4  False    False   True     True   True   True
    5  False    False   True    False   True   True
    6  False    False   True    False   True  False
    7   True     True  False    False  False  False

min_support : float (default: 0.5)

A float between 0 and 1 for minimum support of the itemsets returned. The support is computed as the fraction transactions_where_item(s)_occur / total_transactions.
use_colnames : bool (default: False)

If true, uses the DataFrames' column names in the returned DataFrame instead of column indices.
max_len : int (default: None)

Given the set of all maximal itemsets, return those that are less than max_len. If None (default) all possible itemsets lengths are evaluated.
verbose : int (default: 0)

Shows the stages of conditional tree generation.

Returns

pandas DataFrame with columns ['support', 'itemsets'] of all maximal itemsets that are >= min_support and < than max_len (if max_len is not None). Each itemset in the 'itemsets' column is of type frozenset, which is a Python built-in type that behaves similarly to sets except that it is immutable (For more info, see https://docs.python.org/3.6/library/stdtypes.html#frozenset).

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpmax/

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

fpmax: Maximal itemsets via the FP-Max algorithm

Overview

References

Related

Example 1 -- Maximal Itemsets

More Examples

API