hmine
: Frequent itemsets via the H-mine algorithm
Hmine function to extract frequent itemsets for association rule mining
from mlxtend.frequent_patterns import hmine
Overview
H-mine [1] (memory-based hyperstructure mining of frequent patterns) is a data mining algorithm used for frequent itemset mining -- the process of finding frequently occurring patterns in large transactional datasets.
H-mine is an improvement over the Apriori and FP-Growth algorithms, offering better performance in terms of time and space complexity. It achieves this by using the H-struct data structure and a more efficient search space traversal method.
H-mine improves upon the FP-Growth algorithm by introducing a novel data structure called the H-struct. The H-struct is a hybrid data structure that combines the benefits of both horizontal and vertical data layouts, making it more efficient for frequent itemset mining.
A distinct feature of this method is that it has very limited and precisely predictable space overhead and runs really fast in memory-based settings. Moreover, it can be scaled up to very large databases by database partitioning, and when the data set becomes dense, (conditional) FP-trees can be constructed dynamically as part of the mining process.
References
[1] Pei J, Han J, Lu H, Nishio S, Tang S and Yang D, "H-Mine: Fast and space-preserving frequent pattern mining in large databases." IIE Transactions, Vol. 39, pp. 593–605, 2007.
Related
Example 1 -- Generating Frequent Itemsets
The hmine
function expects data in a one-hot encoded pandas DataFrame.
Suppose we have the following transaction data:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
We can transform it into the right format via the TransactionEncoder
as follows:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df
Apple | Corn | Dill | Eggs | Ice cream | Kidney Beans | Milk | Nutmeg | Onion | Unicorn | Yogurt | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | True | False | True | True | True | True | False | True |
1 | False | False | True | True | False | True | False | True | True | False | True |
2 | True | False | False | True | False | True | True | False | False | False | False |
3 | False | True | False | False | False | True | True | False | False | True | True |
4 | False | True | False | True | True | True | False | False | True | False | False |
Now, let us return the items and itemsets with at least 60% support:
from mlxtend.frequent_patterns import hmine
hmine(df, min_support=0.6)
support | itemsets | |
---|---|---|
0 | 0.8 | (3) |
1 | 0.8 | (3, 5) |
2 | 0.6 | (8, 3, 5) |
3 | 0.6 | (8, 3) |
4 | 1.0 | (5) |
5 | 0.6 | (5, 6) |
6 | 0.6 | (8, 5) |
7 | 0.6 | (10, 5) |
8 | 0.6 | (6) |
9 | 0.6 | (8) |
10 | 0.6 | (10) |
By default, hmine
returns the column indices of the items, which may be useful in downstream operations such as association rule mining. For better readability, we can set use_colnames=True
to convert these integer values into the respective item names:
hmine(df, min_support=0.6, use_colnames=True)
support | itemsets | |
---|---|---|
0 | 0.8 | (Eggs) |
1 | 0.8 | (Eggs, Kidney Beans) |
2 | 0.6 | (Eggs, Kidney Beans, Onion) |
3 | 0.6 | (Eggs, Onion) |
4 | 1.0 | (Kidney Beans) |
5 | 0.6 | (Milk, Kidney Beans) |
6 | 0.6 | (Kidney Beans, Onion) |
7 | 0.6 | (Yogurt, Kidney Beans) |
8 | 0.6 | (Milk) |
9 | 0.6 | (Onion) |
10 | 0.6 | (Yogurt) |
Example 2 -- H-Mine versus Apriori and FP-Growth
Since the hmine
algorithm is a memory-based algorithm, it can be magnitudes faster than the alternative Apriori algorithm for large datasets. However, it can be much slower than the FP-Growth algorithm. In the following example, we compare the performance of hmine
with the apriori
and fpgrowth
algorithms on a small dataset.
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
from mlxtend.frequent_patterns import apriori
%timeit -n 100 -r 10 apriori(df, min_support=0.6, use_colnames=True)
3.41 ms ± 584 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit -n 100 -r 10 apriori(df, min_support=0.6, use_colnames=True, low_memory=True)
3.36 ms ± 404 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
from mlxtend.frequent_patterns import fpgrowth
%timeit -n 100 -r 10 fpgrowth(df, min_support=0.6, use_colnames=True)
1.18 ms ± 76.7 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
from mlxtend.frequent_patterns import hmine
%timeit -n 100 -r 10 hmine(df, min_support=0.6, use_colnames=True)
1.44 ms ± 94.8 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
Example 3 -- Working with Sparse Representations
To save memory, you may want to represent your transaction data in the sparse format. This is especially useful if you have lots of products and small transactions.
oht_ary = te.fit(dataset).transform(dataset, sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary, columns=te.columns_)
sparse_df
Apple | Corn | Dill | Eggs | Ice cream | Kidney Beans | Milk | Nutmeg | Onion | Unicorn | Yogurt | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1 | 0 | True | 1 | 1 | 1 | 0 | 1 |
1 | 0 | 0 | 1 | 1 | 0 | True | 0 | 1 | 1 | 0 | 1 |
2 | 1 | 0 | 0 | 1 | 0 | True | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 1 | 0 | 0 | 0 | True | 1 | 0 | 0 | 1 | 1 |
4 | 0 | 1 | 0 | 1 | 1 | True | 0 | 0 | 1 | 0 | 0 |
hmine(sparse_df, min_support=0.6, use_colnames=True, verbose=1)
2 itemset(s) from the suffixes on item(s) (Eggs)
1 itemset(s) from the suffixes on item(s) (Eggs, Kidney Beans)
0 itemset(s) from the suffixes on item(s) (Eggs, Kidney Beans, Onion)
0 itemset(s) from the suffixes on item(s) (Eggs, Onion)
3 itemset(s) from the suffixes on item(s) (Kidney Beans)
0 itemset(s) from the suffixes on item(s) (Kidney Beans, Milk)
0 itemset(s) from the suffixes on item(s) (Kidney Beans, Onion)
0 itemset(s) from the suffixes on item(s) (Kidney Beans, Yogurt)
0 itemset(s) from the suffixes on item(s) (Milk)
0 itemset(s) from the suffixes on item(s) (Onion)
0 itemset(s) from the suffixes on item(s) (Yogurt)
support | itemsets | |
---|---|---|
0 | 0.8 | (Eggs) |
1 | 0.8 | (Eggs, Kidney Beans) |
2 | 0.6 | (Eggs, Kidney Beans, Onion) |
3 | 0.6 | (Eggs, Onion) |
4 | 1.0 | (Kidney Beans) |
5 | 0.6 | (Milk, Kidney Beans) |
6 | 0.6 | (Kidney Beans, Onion) |
7 | 0.6 | (Yogurt, Kidney Beans) |
8 | 0.6 | (Milk) |
9 | 0.6 | (Onion) |
10 | 0.6 | (Yogurt) |
More Examples
Please note that since the hmine
function is a drop-in replacement for apriori
and fpgrowth
, it comes with the same set of function arguments and return arguments. Thus, for more examples, please see the apriori
documentation.
API
hmine(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0) -> pandas.core.frame.DataFrame
Get frequent itemsets from a one-hot DataFrame
Parameters
-
df
: pandas DataFramepandas DataFrame the encoded format. Also supports DataFrames with sparse data; for more info, please see https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#sparse-data-structures.
Please note that the old pandas SparseDataFrame format is no longer supported in mlxtend >= 0.17.2.
The allowed values are either 0/1 or True/False. For example,
Apple Bananas Beer Chicken Milk Rice 0 True False True True False True 1 True False True False False True 2 True False True False False False 3 True True False False False False 4 False False True True True True 5 False False True False True True 6 False False True False True False 7 True True False False False False
-
min_support
: float (default: 0.5)A float between 0 and 1 for minimum support of the itemsets returned. The support is computed as the fraction transactions_where_item(s)_occur / total_transactions.
-
use_colnames
: bool (default: False)If true, uses the DataFrames' column names in the returned DataFrame instead of column indices.
-
max_len
: int (default: None)Maximum length of the itemsets generated. If
None
(default) all possible itemsets lengths are evaluated. -
verbose
: int (default: 0)Shows the stages of conditional tree generation.
Returns
pandas DataFrame with columns ['support', 'itemsets'] of all itemsets
that are >= min_support
and < than max_len
(if max_len
is not None).
Each itemset in the 'itemsets' column is of type frozenset
,
which is a Python built-in type that behaves similarly to
sets except that it is immutable
(For more info, see
https://docs.python.org/3.6/library/stdtypes.html#frozenset).
Examples
For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/hmine/
ython