bootstrap_point632_score
An implementation of the .632 bootstrap to evaluate supervised learning algorithms.
from mlxtend.evaluate import bootstrap_point632_score
Overview
Originally, the bootstrap method aims to determine the statistical properties of an estimator when the underlying distribution was unknown and additional samples are not available. Now, in order to exploit this method for the evaluation of predictive models, such as hypotheses for classification and regression, we may prefer a slightly different approach to bootstrapping using the socalled OutOfBag (OOB) or LeaveOneOut Bootstrap (LOOB) technique. Here, we use outofbag samples as test sets for evaluation instead of evaluating the model on the training data. Outofbag samples are the unique sets of instances that are not used for model fitting as shown in the figure below [1].
The figure above illustrates how three random bootstrap samples drawn from an exemplary tensample dataset () and their outofbag sample for testing may look like. In practice, Bradley Efron and Robert Tibshirani recommend drawing 50 to 200 bootstrap samples as being sufficient for reliable estimates [2].
In 1983, Bradley Efron described the .632 Estimate, a further improvement to address the pessimistic bias of the bootstrap crossvalidation approach described above (Efron, 1983). The pessimistic bias in the "classic" bootstrap method can be attributed to the fact that the bootstrap samples only contain approximately 63.2% of the unique samples from the original dataset. For instance, we can compute the probability that a given sample from a dataset of size n is not drawn as a bootstrap sample as
which is asymptotically equivalent to as
Vice versa, we can then compute the probability that a sample is chosen as for reasonably large datasets, so that we'd select approximately uniques samples as bootstrap training sets and reserve outofbag samples for testing in each iteration.
Now, to address the bias that is due to this the sampling with replacement, Bradley Efron proposed the .632 Estimate that we mentioned earlier, which is computed via the following equation:
where is the resubstitution accuracy, and is the accuracy on the outofbag sample.
References
 [1] https://sebastianraschka.com/blog/2016/modelevaluationselectionpart2.html
 [2] Efron, Bradley, and Robert J. Tibshirani. An introduction to the bootstrap. CRC press, 1994. Management of Data (ACM SIGMOD '97), pages 265276, 1997.
Example 1  Evaluating the predictive performance of a model
The bootstrap_point632_score
function mimics the behavior of scikitlearn's `cross_val_score, and a typically usage example is shown below:
from sklearn import datasets, linear_model
from mlxtend.evaluate import bootstrap_point632_score
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
lr = linear_model.LogisticRegression()
# Model accuracy
scores = bootstrap_point632_score(lr, X, y)
acc = np.mean(scores)
print('Accuracy: %.2f%%' % (100*acc))
# Confidence interval
lower = np.percentile(scores, 2.5)
upper = np.percentile(scores, 97.5)
print('95%% Confidence interval: [%.2f, %.2f]' % (100*lower, 100*upper))
Accuracy: 94.99%
95% Confidence interval: [90.76, 98.28]
API
bootstrap_point632_score(estimator, X, y, n_splits=200, method='.632', scoring=None, random_seed=None)
Implementation of the 0.632 bootstrap for supervised learning
Parameters

estimator
: objectAn estimator for classification or regression that follows the scikitlearn API and implements "fit" and "predict" methods.

X
: arraylikeThe data to fit. Can be, for example a list, or an array at least 2d.

y
: arraylike, optional, default: NoneThe target variable to try to predict in the case of supervised learning.

n_splits
: int (default=200)Number of bootstrap iterations. Must be larger than 1.

method
: str (default='.632')The bootstrap method, which can be either the regular '.632' bootstrap (default) or the '.632+' bootstrap (not implemented, yet).

scoring
: str, callable, or None (default: None)If None (default), uses 'accuracy' for sklearn classifiers and 'r2' for sklearn regressors. If str, uses a sklearn scoring metric string identifier, for example {'accuracy', 'f1', 'precision', 'recall', 'roc_auc', etc.} for classifiers, {'mean_absolute_error', 'mean_squared_error'/'neg_mean_squared_error', 'median_absolute_error', 'r2', etc.} for regressors. If a callable object or function is provided, it has to be conform with sklearn's signature
scorer(estimator, X, y)
; see http://scikitlearn.org/stable/modules/generated/sklearn.metrics.make_scorer.html for more information. 
random_seed
: int (default=None)If int, random_seed is the seed used by the random number generator.
Returns

scores
: array of float, shape=(len(list(n_splits)),)Array of scores of the estimator for each bootstrap replicate.
Examples
>>> from sklearn import datasets, linear_model
>>> from mlxtend.evaluate import bootstrap_point632_score
>>> import numpy as np
>>> iris = datasets.load_iris()
>>> X = iris.data
>>> y = iris.target
>>> lr = linear_model.LogisticRegression()
>>> scores = bootstrap_point632_score(lr, X, y)
>>> acc = np.mean(scores)
>>> print('Accuracy:', acc)
Accuracy: 0.953023146884
>>> lower = np.percentile(scores, 2.5)
>>> upper = np.percentile(scores, 97.5)
>>> print('95%% Confidence interval: [%.2f, %.2f]' % (lower, upper))
95% Confidence interval: [0.90, 0.98]