One-Hot Encoding of Transaction Data

One-hot encoder class for transaction data in Python lists

from mlxtend.preprocessing import OnehotTransactions

Overview

Encodes database transaction data in form of a Python list of lists into a one-hot encoded NumPy integer array.

Example 1

Suppose we have the following transaction data:

from mlxtend.preprocessing import OnehotTransactions

dataset = [['Apple', 'Beer', 'Rice', 'Chicken'],
           ['Apple', 'Beer', 'Rice'],
           ['Apple', 'Beer'],
           ['Apple', 'Bananas'],
           ['Milk', 'Beer', 'Rice', 'Chicken'],
           ['Milk', 'Beer', 'Rice'],
           ['Milk', 'Beer'],
           ['Apple', 'Bananas']]

Using and OnehotTransaction object, we can transform this dataset into a one-hot encoded format suitable for typical machine learning APIs. Via the fit method, the OnehotTransaction encoder learns the unique labels in the dataset, and via the transform method, it transforms the input dataset (a Python list of lists) into a one-hot encoded NumPy integer array:

oht = OnehotTransactions()
oht_ary = oht.fit(dataset).transform(dataset)
oht_ary
array([[1, 0, 1, 1, 0, 1],
       [1, 0, 1, 0, 0, 1],
       [1, 0, 1, 0, 0, 0],
       [1, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 1],
       [0, 0, 1, 0, 1, 1],
       [0, 0, 1, 0, 1, 0],
       [1, 1, 0, 0, 0, 0]])

After fitting, the unique column names that correspond to the data array shown above can be accessed via the columns_ attribute:

oht.columns_
['Apple', 'Bananas', 'Beer', 'Chicken', 'Milk', 'Rice']

For our convenience, we can turn the one-hot encoded array into a pandas DataFrame:

import pandas as pd

pd.DataFrame(oht_ary, columns=oht.columns_)
Apple Bananas Beer Chicken Milk Rice
0 1 0 1 1 0 1
1 1 0 1 0 0 1
2 1 0 1 0 0 0
3 1 1 0 0 0 0
4 0 0 1 1 1 1
5 0 0 1 0 1 1
6 0 0 1 0 1 0
7 1 1 0 0 0 0

If we desire, we can turn the one-hot encoded array back into a transaction list of lists via the inverse_transform function:

first4 = oht_ary[:4]
oht.inverse_transform(first4)
[['Apple', 'Beer', 'Chicken', 'Rice'],
 ['Apple', 'Beer', 'Rice'],
 ['Apple', 'Beer'],
 ['Apple', 'Bananas']]

API

OnehotTransactions()

One-hot encoder class for transaction data in Python lists

Parameters

None

Attributes

columns_: list List of unique names in the X input list of lists

Methods


fit(X)

Learn unique column names from transaction DataFrame

Parameters

  • X : list of lists

    A python list of lists, where the outer list stores the n transactions and the inner list stores the items in each transaction.

    For example, [['Apple', 'Beer', 'Rice', 'Chicken'], ['Apple', 'Beer', 'Rice'], ['Apple', 'Beer'], ['Apple', 'Bananas'], ['Milk', 'Beer', 'Rice', 'Chicken'], ['Milk', 'Beer', 'Rice'], ['Milk', 'Beer'], ['Apple', 'Bananas']]


fit_transform(X)

Fit a OnehotTransactions encoder and transform a dataset.


inverse_transform(onehot)

Transforms a one-hot encoded NumPy array back into transactions.

Parameters

  • onehot : NumPy array [n_transactions, n_unique_items]

    The NumPy one-hot encoded integer array of the input transactions, where the columns represent the unique items found in the input array in alphabetic order

    For example, array([[1, 0, 1, 1, 0, 1], [1, 0, 1, 0, 0, 1], [1, 0, 1, 0, 0, 0], [1, 1, 0, 0, 0, 0], [0, 0, 1, 1, 1, 1], [0, 0, 1, 0, 1, 1], [0, 0, 1, 0, 1, 0], [1, 1, 0, 0, 0, 0]]) The corresponding column labels are available as self.columns_, e.g., ['Apple', 'Bananas', 'Beer', 'Chicken', 'Milk', 'Rice']

Returns

  • X : list of lists

    A python list of lists, where the outer list stores the n transactions and the inner list stores the items in each transaction.

    For example, [['Apple', 'Beer', 'Rice', 'Chicken'], ['Apple', 'Beer', 'Rice'], ['Apple', 'Beer'], ['Apple', 'Bananas'], ['Milk', 'Beer', 'Rice', 'Chicken'], ['Milk', 'Beer', 'Rice'], ['Milk', 'Beer'], ['Apple', 'Bananas']]


transform(X)

Transform transactions into a one-hot encoded NumPy array.

Parameters

  • X : list of lists

    A python list of lists, where the outer list stores the n transactions and the inner list stores the items in each transaction.

    For example, [['Apple', 'Beer', 'Rice', 'Chicken'], ['Apple', 'Beer', 'Rice'], ['Apple', 'Beer'], ['Apple', 'Bananas'], ['Milk', 'Beer', 'Rice', 'Chicken'], ['Milk', 'Beer', 'Rice'], ['Milk', 'Beer'], ['Apple', 'Bananas']]

Returns

  • onehot : NumPy array [n_transactions, n_unique_items]

    The NumPy one-hot encoded integer array of the input transactions, where the columns represent the unique items found in the input array in alphabetic order

    For example, array([[1, 0, 1, 1, 0, 1], [1, 0, 1, 0, 0, 1], [1, 0, 1, 0, 0, 0], [1, 1, 0, 0, 0, 0], [0, 0, 1, 1, 1, 1], [0, 0, 1, 0, 1, 1], [0, 0, 1, 0, 1, 0], [1, 1, 0, 0, 0, 0]]) The corresponding column labels are available as self.columns_, e.g., ['Apple', 'Bananas', 'Beer', 'Chicken', 'Milk', 'Rice']