ColumnSelector

Implementation of a column selector class for scikit-learn pipelines.

from mlxtend.feature_selection import ColumnSelector

Overview

The ColumnSelector can be used for "manual" feature selection, e.g., as part of a grid search via a scikit-learn pipeline.

References

-

Example 1 - Fitting an Estimator on a Feature Subset

Load a simple benchmark dataset:

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

The ColumnSelector is a simple transformer class that selects specific columns (features) from a datast. For instance, using the transform method returns a reduced dataset that only contains two features (here: the first two features via the indices 0 and 1, respectively):

from mlxtend.feature_selection import ColumnSelector

col_selector = ColumnSelector(cols=(0, 1))
# col_selector.fit(X) # optional, does not do anything
col_selector.transform(X).shape
(150, 2)

Similarly, we can use the ColumnSelector as part of a scikit-learn Pipeline:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline


pipe = make_pipeline(StandardScaler(),
                     ColumnSelector(cols=(0, 1)),
                     KNeighborsClassifier())

pipe.fit(X, y)
pipe.score(X, y)
0.83999999999999997

Example 2 - Feature Selection via GridSearch

Example 1 showed a simple useage example of the ColumnSelector; however, selecting columns from a dataset is trivial and does not require a specific transformer class since we could have achieved the same results via

classifier.fit(X[:, :2], y)
classifier.score(X[:, :2], y)

However, the ColumnSelector becomes really useful for feature selection as part of a grid search as shown in this example.

Load a simple benchmark dataset:

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

Create all possible combinations:

from itertools import combinations

all_comb = []
for size in range(1, 5):
    all_comb += list(combinations(range(X.shape[1]), r=size))
print(all_comb)
[(0,), (1,), (2,), (3,), (0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3), (0, 1, 2), (0, 1, 3), (0, 2, 3), (1, 2, 3), (0, 1, 2, 3)]

Feature and model selection via grid search:

from mlxtend.feature_selection import ColumnSelector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(),
                     ColumnSelector(),
                     KNeighborsClassifier())

param_grid = {'columnselector__cols': all_comb,
              'kneighborsclassifier__n_neighbors': list(range(1, 11))}

grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
grid.fit(X, y)
print('Best parameters:', grid.best_params_)
print('Best performance:', grid.best_score_)
Best parameters: {'columnselector__cols': (2, 3), 'kneighborsclassifier__n_neighbors': 1}
Best performance: 0.98

API

ColumnSelector(cols=None)

Base class for all estimators in scikit-learn

Notes

All estimators should specify all the parameters that can be set at the class level in their __init__ as explicit keyword arguments (no *args or **kwargs).

Methods


fit(X, y=None)

Mock method. Does nothing.

Parameters

  • X : {array-like, sparse matrix}, shape = [n_samples, n_features]

    Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y : array-like, shape = [n_samples] (default: None)

Returns

self


fit_transform(X, y=None)

Return a slice of the input array.

Parameters

  • X : {array-like, sparse matrix}, shape = [n_samples, n_features]

    Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y : array-like, shape = [n_samples] (default: None)

Returns

  • X_slice : shape = [n_samples, k_features]

    Subset of the feature space where k_features <= n_features


get_params(deep=True)

Get parameters for this estimator.

Parameters

  • deep : boolean, optional

    If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

  • params : mapping of string to any

    Parameter names mapped to their values.


set_params(params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Returns

self


transform(X, y=None)

Return a slice of the input array.

Parameters

  • X : {array-like, sparse matrix}, shape = [n_samples, n_features]

    Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y : array-like, shape = [n_samples] (default: None)

Returns

  • X_slice : shape = [n_samples, k_features]

    Subset of the feature space where k_features <= n_features