Mlxtend.feature selection

mlxtend version: 0.23.4

ColumnSelector

ColumnSelector(cols=None, drop_axis=False)

Object for selecting specific columns from a data set.

Parameters

cols : array-like (default: None)

A list specifying the feature indices to be selected. For example, [1, 4, 5] to select the 2nd, 5th, and 6th feature columns, and ['A','C','D'] to select the name of feature columns A, C and D. If None, returns all columns in the array.
drop_axis : bool (default=False)

Drops last axis if True and the only one column is selected. This is useful, e.g., when the ColumnSelector is used for selecting only one column and the resulting array should be fed to e.g., a scikit-learn column selector. E.g., instead of returning an array with shape (n_samples, 1), drop_axis=True will return an aray with shape (n_samples,).

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/feature_selection/ColumnSelector/

Methods

fit(X, y=None)

Mock method. Does nothing.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples] (default: None)

Returns

self

fit_transform(X, y=None)

Return a slice of the input array.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples] (default: None)

Returns

X_slice : shape = [n_samples, k_features]

Subset of the feature space where k_features <= n_features

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params : dict

Parameter names mapped to their values.

set_params(params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects
(such as :class:`~sklearn.pipeline.Pipeline`). The latter have
parameters of the form ``<component>__<parameter>`` so that it's
possible to update each component of a nested object.

Parameters

**params : dict

Estimator parameters.

Returns

self : estimator instance

Estimator instance.

transform(X, y=None)

Return a slice of the input array.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples] (default: None)

Returns

X_slice : shape = [n_samples, k_features]

Subset of the feature space where k_features <= n_features

ExhaustiveFeatureSelector

ExhaustiveFeatureSelector(estimator, min_features=1, max_features=1, print_progress=True, scoring='accuracy', cv=5, n_jobs=1, pre_dispatch='2n_jobs', clone_estimator=True, fixed_features=None, feature_groups=None)*

Exhaustive Feature Selection for Classification and Regression. (new in v0.4.3)

Parameters

estimator : scikit-learn classifier or regressor
min_features : int (default: 1)

Minumum number of features to select
max_features : int (default: 1)

Maximum number of features to select. If parameter feature_groups is not None, the number of features is equal to the number of feature groups, i.e. len(feature_groups). For example, if feature_groups = [[0], [1], [2, 3], [4]], then the max_features value cannot exceed 4.
print_progress : bool (default: True)

Prints progress as the number of epochs to stderr.
scoring : str, (default='accuracy')

Scoring metric in {accuracy, f1, precision, recall, roc_auc} for classifiers, {'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'r2'} for regressors, or a callable object or function with signature scorer(estimator, X, y).
cv : int (default: 5)

Scikit-learn cross-validation generator or int. If estimator is a classifier (or y consists of integer class labels), stratified k-fold is performed, and regular k-fold cross-validation otherwise. No cross-validation if cv is None, False, or 0.
n_jobs : int (default: 1)

The number of CPUs to use for evaluating different feature subsets in parallel. -1 means 'all CPUs'.
pre_dispatch : int, or string (default: '2*n_jobs')

Controls the number of jobs that get dispatched during parallel execution if n_jobs > 1 or n_jobs=-1. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function of n_jobs, as in 2*n_jobs
clone_estimator : bool (default: True)

Clones estimator if True; works with the original estimator instance if False. Set to False if the estimator doesn't implement scikit-learn's set_params and get_params methods. In addition, it is required to set cv=0, and n_jobs=1.
fixed_features : tuple (default: None)

If not None, the feature indices provided as a tuple will be regarded as fixed by the feature selector. For example, if fixed_features=(1, 3, 7), the 2nd, 4th, and 8th feature are guaranteed to be present in the solution. Note that if fixed_features is not None, make sure that the number of features to be selected is greater than len(fixed_features). In other words, ensure that k_features > len(fixed_features).
feature_groups : list or None (default: None)

Optional argument for treating certain features as a group. This means, the features within a group are always selected together, never split. For example, feature_groups=[[1], [2], [3, 4, 5]] specifies 3 feature groups.In this case, possible feature selection results with k_features=2 are [[1], [2], [[1], [3, 4, 5]], or [[2], [3, 4, 5]]. Feature groups can be useful for interpretability, for example, if features 3, 4, 5 are one-hot encoded features. (For more details, please read the notes at the bottom of this docstring). New in mlxtend v. 0.21.0.

Attributes

best_idx_ : array-like, shape = [n_predictions]

Feature Indices of the selected feature subsets.
best_feature_names_ : array-like, shape = [n_predictions]

Feature names of the selected feature subsets. If pandas DataFrames are used in the fit method, the feature names correspond to the column names. Otherwise, the feature names are string representation of the feature array indices. New in v 0.13.0.
best_score_ : float

Cross validation average score of the selected subset.
subsets_ : dict

A dictionary of selected feature subsets during the exhaustive selection, where the dictionary keys are the lengths k of these feature subsets. The dictionary values are dictionaries themselves with the following keys: 'feature_idx' (tuple of indices of the feature subset) 'feature_names' (tuple of feature names of the feat. subset) 'cv_scores' (list individual cross-validation scores) 'avg_score' (average cross-validation score) Note that if pandas DataFrames are used in the fit method, the 'feature_names' correspond to the column names. Otherwise, the feature names are string representation of the feature array indices. The 'feature_names' is new in v. 0.13.0.

Notes

(1) If parameter feature_groups is not None, the number of features is equal to the number of feature groups, i.e. len(feature_groups). For example, if feature_groups = [[0], [1], [2, 3], [4]], then the max_features value cannot exceed 4.

(2) Although two or more individual features may be considered as one group
throughout the feature-selection process, it does not mean the individual
features of that group have the same impact on the outcome. For instance, in
linear regression, the coefficient of the feature 2 and 3 can be different
even if they are considered as one group in feature_groups.

(3) If both fixed_features and feature_groups are specified, ensure that each
feature group contains the fixed_features selection. E.g., for a 3-feature set
fixed_features=[0, 1] and feature_groups=[[0, 1], [2]] is valid;
fixed_features=[0, 1] and feature_groups=[[0], [1, 2]] is not valid.

(4) In case of KeyboardInterrupt, the dictionary subsets may not be completed.
If user is still interested in getting the best score, they can use method
`finalize_fit`.

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/

Methods

finalize_fit()

None

fit(X, y, groups=None, fit_params)

Perform feature selection and learn model from training data.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.
y : array-like, shape = [n_samples]

Target values.
groups : array-like, with shape (n_samples,), optional

Group labels for the samples used while splitting the dataset into train/test set. Passed to the fit method of the cross-validator.
fit_params : dict of string -> object, optional

Parameters to pass to to the fit method of classifier.

Returns

self : object

fit_transform(X, y, groups=None, fit_params)

Fit to training data and return the best selected features from X.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.
y : array-like, shape = [n_samples]

Target values.
groups : array-like, with shape (n_samples,), optional

Group labels for the samples used while splitting the dataset into train/test set. Passed to the fit method of the cross-validator.
fit_params : dict of string -> object, optional

Parameters to pass to to the fit method of classifier.

Returns

Feature subset of X, shape={n_samples, k_features}

get_metric_dict(confidence_interval=0.95)

Return metric dictionary

Parameters

confidence_interval : float (default: 0.95)

A positive float between 0.0 and 1.0 to compute the confidence interval bounds of the CV score averages.

Returns

Dictionary with items where each dictionary value is a list with the number of iterations (number of feature subsets) as its length. The dictionary keys corresponding to these lists are as follows: 'feature_idx': tuple of the indices of the feature subset 'cv_scores': list with individual CV scores 'avg_score': of CV average scores 'std_dev': standard deviation of the CV score average 'std_err': standard error of the CV score average 'ci_bound': confidence interval bound of the CV score average

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params : dict

Parameter names mapped to their values.

set_params(params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects
(such as :class:`~sklearn.pipeline.Pipeline`). The latter have
parameters of the form ``<component>__<parameter>`` so that it's
possible to update each component of a nested object.

Parameters

**params : dict

Estimator parameters.

Returns

self : estimator instance

Estimator instance.

transform(X)

Return the best selected features from X.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.

Returns

Feature subset of X, shape={n_samples, k_features}

SequentialFeatureSelector

SequentialFeatureSelector(estimator, k_features=1, forward=True, floating=False, verbose=0, scoring=None, cv=5, n_jobs=1, pre_dispatch='2n_jobs', clone_estimator=True, fixed_features=None, feature_groups=None)*

Sequential Feature Selection for Classification and Regression.

Parameters

estimator : scikit-learn classifier or regressor
k_features : int or tuple or str (default: 1)

Number of features to select, where k_features < the full feature set. New in 0.4.2: A tuple containing a min and max value can be provided, and the SFS will consider return any feature combination between min and max that scored highest in cross-validation. For example, the tuple (1, 4) will return any combination from 1 up to 4 features instead of a fixed number of features k. New in 0.8.0: A string argument "best" or "parsimonious". If "best" is provided, the feature selector will return the feature subset with the best cross-validation performance. If "parsimonious" is provided as an argument, the smallest feature subset that is within one standard error of the cross-validation performance will be selected.
forward : bool (default: True)

Forward selection if True, backward selection otherwise
floating : bool (default: False)

Adds a conditional exclusion/inclusion if True.
verbose : int (default: 0), level of verbosity to use in logging.

If 0, no output, if 1 number of features in current set, if 2 detailed logging i ncluding timestamp and cv scores at step.
scoring : str, callable, or None (default: None)

If None (default), uses 'accuracy' for sklearn classifiers and 'r2' for sklearn regressors. If str, uses a sklearn scoring metric string identifier, for example {accuracy, f1, precision, recall, roc_auc} for classifiers, {'mean_absolute_error', 'mean_squared_error'/'neg_mean_squared_error', 'median_absolute_error', 'r2'} for regressors. If a callable object or function is provided, it has to be conform with sklearn's signature scorer(estimator, X, y); see https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html for more information.
cv : int (default: 5)

Integer or iterable yielding train, test splits. If cv is an integer and estimator is a classifier (or y consists of integer class labels) stratified k-fold. Otherwise regular k-fold cross-validation is performed. No cross-validation if cv is None, False, or 0.
n_jobs : int (default: 1)

The number of CPUs to use for evaluating different feature subsets in parallel. -1 means 'all CPUs'.
pre_dispatch : int, or string (default: '2*n_jobs')

Controls the number of jobs that get dispatched during parallel execution if n_jobs > 1 or n_jobs=-1. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function of n_jobs, as in 2*n_jobs
clone_estimator : bool (default: True)

Clones estimator if True; works with the original estimator instance if False. Set to False if the estimator doesn't implement scikit-learn's set_params and get_params methods. In addition, it is required to set cv=0, and n_jobs=1.
fixed_features : tuple (default: None)

If not None, the feature indices provided as a tuple will be regarded as fixed by the feature selector. For example, if fixed_features=(1, 3, 7), the 2nd, 4th, and 8th feature are guaranteed to be present in the solution. Note that if fixed_features is not None, make sure that the number of features to be selected is greater than len(fixed_features). In other words, ensure that k_features > len(fixed_features). New in mlxtend v. 0.18.0.
feature_groups : list or None (default: None)

Optional argument for treating certain features as a group. This means, the features within a group are always selected together, never split. For example, feature_groups=[[1], [2], [3, 4, 5]] specifies 3 feature groups. In this case, possible feature selection results with k_features=2 are [[1], [2], [[1], [3, 4, 5]], or [[2], [3, 4, 5]]. Feature groups can be useful for interpretability, for example, if features 3, 4, 5 are one-hot encoded features. (For more details, please read the notes at the bottom of this docstring). New in mlxtend v. 0.21.0.

Attributes

k_feature_idx_ : array-like, shape = [n_predictions]

Feature Indices of the selected feature subsets.
k_feature_names_ : array-like, shape = [n_predictions]

Feature names of the selected feature subsets. If pandas DataFrames are used in the fit method, the feature names correspond to the column names. Otherwise, the feature names are string representation of the feature array indices. New in v 0.13.0.
k_score_ : float

Cross validation average score of the selected subset.
subsets_ : dict

A dictionary of selected feature subsets during the sequential selection, where the dictionary keys are the lengths k of these feature subsets. If the parameter feature_groups is not None, the value of key indicates the number of groups that are selected together. The dictionary values are dictionaries themselves with the following keys: 'feature_idx' (tuple of indices of the feature subset) 'feature_names' (tuple of feature names of the feat. subset) 'cv_scores' (list individual cross-validation scores) 'avg_score' (average cross-validation score) Note that if pandas DataFrames are used in the fit method, the 'feature_names' correspond to the column names. Otherwise, the feature names are string representation of the feature array indices. The 'feature_names' is new in v 0.13.0.

Notes

(2) Although two or more individual features may be considered as one group
throughout the feature-selection process, it does not mean the individual
features of that group have the same impact on the outcome. For instance, in
linear regression, the coefficient of the feature 2 and 3 can be different
even if they are considered as one group in feature_groups.

(3) If both fixed_features and feature_groups are specified, ensure that each
feature group contains the fixed_features selection. E.g., for a 3-feature set
fixed_features=[0, 1] and feature_groups=[[0, 1], [2]] is valid;
fixed_features=[0, 1] and feature_groups=[[0], [1, 2]] is not valid.

(4) In case of KeyboardInterrupt, the dictionary subsets may not be completed.
If user is still interested in getting the best score, they can use method
`finalize_fit`.

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/

Methods

finalize_fit()

None

fit(X, y, groups=None, fit_params)

Perform feature selection and learn model from training data.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.
y : array-like, shape = [n_samples]

Target values. New in v 0.13.0: pandas DataFrames are now also accepted as argument for y.
groups : array-like, with shape (n_samples,), optional

Group labels for the samples used while splitting the dataset into train/test set. Passed to the fit method of the cross-validator.
fit_params : various, optional

Additional parameters that are being passed to the estimator. For example, sample_weights=weights.

Returns

self : object

fit_transform(X, y, groups=None, fit_params)

Fit to training data then reduce X to its most important features.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.
y : array-like, shape = [n_samples]

Target values. New in v 0.13.0: a pandas Series are now also accepted as argument for y.
groups : array-like, with shape (n_samples,), optional

Group labels for the samples used while splitting the dataset into train/test set. Passed to the fit method of the cross-validator.
fit_params : various, optional

Additional parameters that are being passed to the estimator. For example, sample_weights=weights.

Returns

Reduced feature subset of X, shape={n_samples, k_features}

generate_error_message_k_features(name)

None

get_metric_dict(confidence_interval=0.95)

Return metric dictionary

Parameters

confidence_interval : float (default: 0.95)

A positive float between 0.0 and 1.0 to compute the confidence interval bounds of the CV score averages.

Returns

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params : dict

Parameter names mapped to their values.

set_params(params)

Set the parameters of this estimator. Valid parameter keys can be listed with get_params().

Returns

self

transform(X)

Reduce X to its most important features.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features. New in v 0.13.0: pandas DataFrames are now also accepted as argument for X.

Returns

Reduced feature subset of X, shape={n_samples, k_features}

Properties

named_estimators

Returns

List of named estimator tuples, like [('svc', SVC(...))]

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search