ftest: Ftest for classifier comparisons
Ftest for comparing the performance of multiple classifiers.
from mlxtend.evaluate import ftest
Overview
In the context of evaluating machine learning models, the Ftest by George W. Snedecor [1] can be regarded as analogous to Cochran's Q test that can be applied to evaluate multiple classifiers (i.e., whether their accuracies estimated on a test set differ) as described by Looney [2][3].
More formally, assume the task to test the null hypothesis that there is no difference between the classification accuracies [1]:
Let be a set of classifiers which have all been tested on the same dataset. If the classifiers do not perform differently, then the F statistic is distributed according to an F distribution with and degrees of freedom, where is the number of examples in the test set. The calculation of the F statistic consists of several components, which are listed below (adopted from [2]).
We start by defining as the average of the accuracies of the different models
The sum of squares of the classifiers is then computed as
where is the proportion of the examples classified correctly by classifier .
The sum of squares for the objects is calculated as follows:
Here, is the number of classifiers out of that correctly classified object , where is the test dataset on which the classifiers are tested on.
Finally, we compute the total sum of squares,
so that we then can compute the sum of squares for the classificationobject interaction:
To compute the F statistic, we next compute the mean SSA and mean SSAB values:
and
From the MSA and MSAB, we can then calculate the Fvalue as
After computing the Fvalue, we can then look up the pvalue from a Fdistribution table for the corresponding degrees of freedom or obtain it computationally from a cumulative Fdistribution function. In practice, if we successfully rejected the null hypothesis at a previously chosen significance threshold, we could perform multiple post hoc pairwise tests  for example, McNemar tests with a Bonferroni correction  to determine which pairs have different population proportions.
References
 [1] Snedecor, George W. and Cochran, William G. (1989), Statistical Methods, Eighth Edition, Iowa State University Press.
 [2] Looney, Stephen W. "A statistical technique for comparing the accuracies of several classifiers." Pattern Recognition Letters 8, no. 1 (1988): 59.
 [3] Kuncheva, Ludmila I. Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2004.
Example 1  Ftest
import numpy as np
from mlxtend.evaluate import ftest
## Dataset:
# ground truth labels of the test dataset:
y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0])
# predictions by 3 classifiers (`y_model_1`, `y_model_2`, and `y_model_3`):
y_model_1 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0])
y_model_2 = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0])
y_model_3 = np.array([1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1])
Assuming a significance level , we can conduct Cochran's Q test as follows, to test the null hypothesis there is no difference between the classification accuracies, :
f, p_value = ftest(y_true,
y_model_1,
y_model_2,
y_model_3)
print('F: %.3f' % f)
print('pvalue: %.3f' % p_value)
F: 3.873
pvalue: 0.022
Since the pvalue is smaller than , we can reject the null hypothesis and conclude that there is a difference between the classification accuracies. As mentioned in the introduction earlier, we could now perform multiple post hoc pairwise tests  for example, McNemar tests with a Bonferroni correction  to determine which pairs have different population proportions.
API
ftest(y_target, y_model_predictions)*
FTest test to compare 2 or more models.
Parameters

y_target
: arraylike, shape=[n_samples]True class labels as 1D NumPy array.

*y_model_predictions
: arraylikes, shape=[n_samples]Variable number of 2 or more arrays that contain the predicted class labels from models as 1D NumPy array.
Returns

f, p
: float or None, floatReturns the Fvalue and the pvalue
Examples
For usage examples, please see http://rasbt.github.io/mlxtend/user_guide/evaluate/ftest/