Cochran's Q Test
Cochran's Q test for comparing the performance of multiple classifiers.
from mlxtend.evaluate import cochrans_q
Overview
Cochran's Q test can be regarded as a generalized version of McNemar's test that can be applied to evaluate multiple classifiers. In a sense, Cochran's Q test is analogous to ANOVA for binary outcomes.
To compare more than two classifiers, we can use Cochran's Q test, which has a test statistic that is approximately, (similar to McNemar's test), distributed as chisquared with degrees of freedom, where L is the number of models we evaluate (since for McNemar's test, McNemars test statistic approximates a chisquared distribution with one degree of freedom).
More formally, Cochran's Q test tests the hypothesis that there is no difference between the classification accuracies [1]:
Let be a set of classifiers who have all been tested on the same dataset. If the L classifiers don't perform differently, then the following Q statistic is distributed approximately as "chisquared" with degrees of freedom:
Here, is the number of objects out of correctly classified by ; is the number of classifiers out of that correctly classified object , where is the test dataset on which the classifers are tested on; and is the total number of correct number of votes among the classifiers [2]:
To perform Cochran's Q test, we typically organize the classificier predictions in a binary matrix. The entry of such matrix is 0 if a classifier has misclassified a data example (vector) and 1 otherwise (if the classifier predicted the class label correctly) [2].
The following example taken from [2] illustrates how the classification results may be organized. For instance, assume we have the ground truth labels of the test dataset y_true
and the following predictions by 3 classifiers (y_model_1
, y_model_2
, and y_model_3
):
y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0])
y_model_1 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0])
y_model_2 = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0])
y_model_3 = np.array([1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1])
The table of correct (1) and incorrect (0) classifications may then look as follows:
(model 1)  (model 2)  (model 3)  Occurrences  

1  1  1  80  
1  1  0  2  
1  0  1  0  
1  0  0  2  
0  1  1  9  
0  1  0  1  
0  0  1  3  
0  0  0  3  
Accuracy  84/100*100% = 84%  92/100*100% = 92%  92/100*100% = 92% 
By plugging in the respective value into the previous equation, we obtain the following value [2]:
(Note that the value in [2] is listed as 3.7647 due to a typo as discussed with the author, the value 7.5294 is the correct one.)
Now, the Q value (approximating ) corresponds to a pvalue of approx. 0.023 assuming a distribution with degrees of freedom. Assuming that we chose a significance level of , we would reject the null hypothesis that all classifiers perform equally well, since .
In practice, if we successfully rejected the null hypothesis, we could perform multiple post hoc pairwise tests  for example, McNemar tests with a Bonferroni correction  to determine which pairs have different population proportions.
References
 [1] Fleiss, Joseph L., Bruce Levin, and Myunghee Cho Paik. Statistical methods for rates and proportions. John Wiley & Sons, 2013.
 [2] Kuncheva, Ludmila I. Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2004.
Example 1  Cochran's Q test
import numpy as np
from mlxtend.evaluate import cochrans_q
from mlxtend.evaluate import mcnemar_table
from mlxtend.evaluate import mcnemar
## Dataset:
# ground truth labels of the test dataset:
y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0])
# predictions by 3 classifiers (`y_model_1`, `y_model_2`, and `y_model_3`):
y_model_1 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0])
y_model_2 = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0])
y_model_3 = np.array([1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1])
Assuming a significance level , we can conduct Cochran's Q test as follows, to test the null hypothesis there is no difference between the classification accuracies, :
q, p_value = cochrans_q(y_true,
y_model_1,
y_model_2,
y_model_3)
print('Q: %.3f' % q)
print('pvalue: %.3f' % p_value)
Q: 7.529
pvalue: 0.023
Since the pvalue is smaller than , we can reject the null hypothesis and conclude that there is a difference between the classification accuracies. As mentioned in the introduction earlier, we could now perform multiple post hoc pairwise tests  for example, McNemar tests with a Bonferroni correction  to determine which pairs have different population proportions.
Lastly, let's illustrate that Cochran's Q test is indeed just a generalized version of McNemar's test:
chi2, p_value = cochrans_q(y_true,
y_model_1,
y_model_2)
print('Cochran\'s Q Chi^2: %.3f' % chi2)
print('Cochran\'s Q pvalue: %.3f' % p_value)
Cochran's Q Chi^2: 5.333
Cochran's Q pvalue: 0.021
chi2, p_value = mcnemar(mcnemar_table(y_true,
y_model_1,
y_model_2),
corrected=False)
print('McNemar\'s Chi^2: %.3f' % chi2)
print('McNemar\'s pvalue: %.3f' % p_value)
McNemar's Chi^2: 5.333
McNemar's pvalue: 0.021
API
cochrans_q(y_target, y_model_predictions)*
Cochran's Q test to compare 2 or more models.
Parameters

y_target
: arraylike, shape=[n_samples]True class labels as 1D NumPy array.

*y_model_predictions
: arraylikes, shape=[n_samples]Variable number of 2 or more arrays that contain the predicted class labels from models as 1D NumPy array.
Returns

q, p
: float or None, floatReturns the Q (chisquared) value and the pvalue