FTest
Ftest for comparing the performance of multiple classifiers.
from mlxtend.evaluate import ftest
Overview
In the context of evaluating machine learning models, the Ftest by George W. Snedecor [1] can be regarded as analogous to Cochran's Q test that can be applied to evaluate multiple classifiers (i.e., whether their accuracies estimated on a test set differ) as described by Looney [2][3].
More formally, assume the task to test the null hypothesis that there is no difference between the classification accuracies [1]:
Let be a set of classifiers who have all been tested on the same dataset. If the L classifiers don't perform differently, then the F statistic is distributed according to an F distribution with ) and degrees of freedom, where is the number of examples in the test set.
The calculation of the F statistic consists of several components, which are listed below (adopted from [3]).
Sum of squares of the classifiers:
where is the number of classifiers out of that correctly classified object , where is the test dataset on which the classifers are tested on.
The sum of squares for the objects:
where is the average of the accuracies of the different models .
The total sum of squares:
The sum of squares for the classificationobject interaction:
The mean SSA and mean SSAB values:
and
From the MSA and MSAB, we can then calculate the Fvalue as
After computing the Fvalue, we can then look up the pvalue from a Fdistribution table for the corresponding degrees of freedom or obtain it computationally from a cumulative Fdistribution function. In practice, if we successfully rejected the null hypothesis at a previously chosen significance threshold, we could perform multiple post hoc pairwise tests  for example, McNemar tests with a Bonferroni correction  to determine which pairs have different population proportions.
References
 [1] Snedecor, George W. and Cochran, William G. (1989), Statistical Methods, Eighth Edition, Iowa State University Press.
 [2] Looney, Stephen W. "A statistical technique for comparing the accuracies of several classifiers." Pattern Recognition Letters 8, no. 1 (1988): 59.
 [3] Kuncheva, Ludmila I. Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2004.
Example 1  Ftest
import numpy as np
from mlxtend.evaluate import ftest
## Dataset:
# ground truth labels of the test dataset:
y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0])
# predictions by 3 classifiers (`y_model_1`, `y_model_2`, and `y_model_3`):
y_model_1 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0])
y_model_2 = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0])
y_model_3 = np.array([1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1])
Assuming a significance level , we can conduct Cochran's Q test as follows, to test the null hypothesis there is no difference between the classification accuracies, :
f, p_value = ftest(y_true,
y_model_1,
y_model_2,
y_model_3)
print('F: %.3f' % f)
print('pvalue: %.3f' % p_value)
F: 3.873
pvalue: 0.022
Since the pvalue is smaller than , we can reject the null hypothesis and conclude that there is a difference between the classification accuracies. As mentioned in the introduction earlier, we could now perform multiple post hoc pairwise tests  for example, McNemar tests with a Bonferroni correction  to determine which pairs have different population proportions.
API
ftest(y_target, y_model_predictions)*
FTest test to compare 2 or more models.
Parameters

y_target
: arraylike, shape=[n_samples]True class labels as 1D NumPy array.

*y_model_predictions
: arraylikes, shape=[n_samples]Variable number of 2 or more arrays that contain the predicted class labels from models as 1D NumPy array.
Returns

f, p
: float or None, floatReturns the Fvalue and the pvalue
Examples
For usage examples, please see http://rasbt.github.io/mlxtend/user_guide/evaluate/ftest/