ColumnSelector

Implementation of a column selector class for scikit-learn pipelines.

from mlxtend.feature_selection import ColumnSelector

Overview

The ColumnSelector can be used for "manual" feature selection, e.g., as part of a grid search via a scikit-learn pipeline.

References

-

Example 1 - Fitting an Estimator on a Feature Subset

Load a simple benchmark dataset:

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

The ColumnSelector is a simple transformer class that selects specific columns (features) from a datast. For instance, using the transform method returns a reduced dataset that only contains two features (here: the first two features via the indices 0 and 1, respectively):

from mlxtend.feature_selection import ColumnSelector

col_selector = ColumnSelector(cols=(0, 1))
# col_selector.fit(X) # optional, does not do anything
col_selector.transform(X).shape
(150, 2)

ColumnSelector works both with numpy arrays and pandas dataframes:

import pandas as pd

iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
col_selector = ColumnSelector(cols=("sepal length (cm)", "sepal width (cm)"))
col_selector.transform(iris_df).shape
(150, 2)

Similarly, we can use the ColumnSelector as part of a scikit-learn Pipeline:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline


pipe = make_pipeline(StandardScaler(),
                     ColumnSelector(cols=(0, 1)),
                     KNeighborsClassifier())

pipe.fit(X, y)
pipe.score(X, y)
0.84

Example 2 - Feature Selection via GridSearch

Example 1 showed a simple useage example of the ColumnSelector; however, selecting columns from a dataset is trivial and does not require a specific transformer class since we could have achieved the same results via

classifier.fit(X[:, :2], y)
classifier.score(X[:, :2], y)

However, the ColumnSelector becomes really useful for feature selection as part of a grid search as shown in this example.

Load a simple benchmark dataset:

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

Create all possible combinations:

from itertools import combinations

all_comb = []
for size in range(1, 5):
    all_comb += list(combinations(range(X.shape[1]), r=size))
print(all_comb)
[(0,), (1,), (2,), (3,), (0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3), (0, 1, 2), (0, 1, 3), (0, 2, 3), (1, 2, 3), (0, 1, 2, 3)]

Feature and model selection via grid search:

from mlxtend.feature_selection import ColumnSelector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(),
                     ColumnSelector(),
                     KNeighborsClassifier())

param_grid = {'columnselector__cols': all_comb,
              'kneighborsclassifier__n_neighbors': list(range(1, 11))}

grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
grid.fit(X, y)
print('Best parameters:', grid.best_params_)
print('Best performance:', grid.best_score_)
Best parameters: {'columnselector__cols': (2, 3), 'kneighborsclassifier__n_neighbors': 1}
Best performance: 0.98

Example 3 -- Scaling of a Subset of Features in a scikit-learn Pipeline

The following example illustrates how we could use the ColumnSelector in tandem with scikit-learn's FeatureUnion to only scale certain features (in this toy example: the first and second feature only) in a datasets in a Pipeline.

from mlxtend.feature_selection import ColumnSelector
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.data import iris_data


X, y = iris_data()

scale_pipe = make_pipeline(ColumnSelector(cols=(0, 1)),
                           MinMaxScaler())

pipeline = Pipeline([
    ('feats', FeatureUnion([
        ('col_1-2', scale_pipe),
        ('col_3-4', ColumnSelector(cols=(2, 3)))
    ])),
    ('clf', KNeighborsClassifier())
])


pipeline.fit(X, y)
Pipeline(memory=None,
     steps=[('feats', FeatureUnion(n_jobs=None,
       transformer_list=[('col_1-2', Pipeline(memory=None,
     steps=[('columnselector', ColumnSelector(cols=(0, 1), drop_axis=False)), ('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1)))])), ('col_3-4', ColumnSelector(cols=(2, 3), drop_axis=Fa...ki',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'))])

API

ColumnSelector(cols=None, drop_axis=False)

Object for selecting specific columns from a data set.

Parameters

Examples

For usage examples, please see http://rasbt.github.io/mlxtend/user_guide/feature_selection/ColumnSelector/

Methods


fit(X, y=None)

Mock method. Does nothing.

Parameters

Returns

self


fit_transform(X, y=None)

Return a slice of the input array.

Parameters

Returns


get_params(deep=True)

Get parameters for this estimator.

Parameters

Returns


set_params(params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Returns

self


transform(X, y=None)

Return a slice of the input array.

Parameters

Returns