proportion_difference: Difference of proportions test for classifiers.
Test of the difference of proportions for classifier performance comparison.
from mlxtend.evaluate import proportion_difference
Overview
There are several different statistical hypothesis testing frameworks that are being used in practice to compare the performance of classification models, including common methods such as difference of two proportions (here, the proportions are the estimated generalization accuracies from a test set), for which we can construct 95% confidence intervals based on the concept of the Normal Approximation to the Binomial that was covered in Part I.
Performing a zscore test for two population proportions is inarguably the most straightforward way to compare to models (but certainly not the best!): In a nutshell, if the 95% confidence intervals of the accuracies of two models do not overlap, we can reject the null hypothesis that the performance of both classifiers is equal at a confidence level of (or 5% probability). Violations of assumptions aside (for instance that the test set samples are not independent), as Thomas Dietterich noted based on empircal results in a simulated study [1], this test tends to have a high false positive rate (here: incorrectly detecting difference when there is none), which is among the reasons why it is not recommended in practice.
Nonetheless, for the sake of completeness, and since it a commonly used method in practice, the general procedure is outlined below as follows (which also generally applies to the different hypothesis tests presented later):
 formulate the hypothesis to be tested (for instance, the null hypothesis stating that the proportions are the same; consequently, the alternative hypothesis that the proportions are different, if we use a twotailed test);
 decide upon a significance threshold (for instance, if the probability of observing a difference more extreme than the one observed is more than 5%, then we plan to reject the null hypothesis);
 analyze the data, compute the test statistic (here: zscore), and compare its associated pvalue (probability) to the previously determined significance threshold;
 based on the pvalue and significance threshold, either accept or reject the null hypothesis at the given confidence level and interpret the results.
The zscore is computed as the observed difference divided by the square root for their combined variances
where is the accuracy of one model and is the accuracy of a second model estimated from the test set. Recall that we computed the variance of the estimated of the estimated accuracy as
in Part I and then computed the confidence interval (Normal Approximation Interval) as where for a 95% confidence interval. Comparing the confidence intervals of two accuracy estimates and checking whether they overlap is then analogous to computing the value for the difference in proportions and comparing the probability (pvalue) to the chosen significance threshold. So, to compute the zscore directly for the difference of two proportions, and , we pool these proportions (assuming that and are the performances of two models estimated on two indendent test sets of size and , respectively),
and compute the standard deviation as
such that we can compute the zscore,
Since, due to using the same test set (and violating the independence assumption) we have , so that we can simplify the zscore computation to
where is simply .
In the second step, based on the computed value (this assumes the the test errors are independent, which is usually violated in practice as we use the same test set) we can reject the null hypothesis that the a pair of models has equal performance (here, measured in "classification aaccuracy") at an level if is greater than 1.96. Or if we want to put in the extra work, we can compute the area under the a standard normal cumulative distribution at the zscore threshold. If we find this pvalue is smaller than a significance level we set prior to conducting the test, then we can reject the null hypothesis at that given significance level.
The problem with this test though is that we use the same test set to compute the accuracy of the two classifiers; thus, it might be better to use a paired test such as a paired sample ttest, but a more robust alternative is the McNemar test.
References
 [1] Dietterich, Thomas G. "Approximate statistical tests for comparing supervised classification learning algorithms." Neural computation 10, no. 7 (1998): 18951923.
Example 1  Difference of Proportions
As an example for applying this test, consider the following 2 model predictions:
import numpy as np
## Dataset:
# ground truth labels of the test dataset:
y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0])
# predictions by 3 classifiers (`y_model_1`, `y_model_2`, and `y_model_3`):
y_model_1 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0])
y_model_2 = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0])
Assume, the test accuracies are as follows:
acc_1 = np.sum(y_true == y_model_1) / y_true.shape[0]
acc_2 = np.sum(y_true == y_model_2) / y_true.shape[0]
print('Accuracy Model 1:', acc_1)
print('Accuracy Model 2:', acc_2)
Accuracy Model 1: 0.84
Accuracy Model 2: 0.92
Now, setting a significance threshold of and conducting the test
from mlxtend.evaluate import proportion_difference
z, p_value = proportion_difference(acc_1, acc_2, n_1=y_true.shape[0])
print('z: %.3f' % z)
print('pvalue: %.3f' % p_value)
z: 1.754
pvalue: 0.040
we find that there is a statistically significant difference between the model performances. It should be highlighted though that using this test, due to the typical independence violation of using the same test set as well as its high false positive rate, it is not recommended to use this test in practice.
API
proportion_difference(proportion_1, proportion_2, n_1, n_2=None)
Computes the test statistic and pvalue for a difference of proportions test.
Parameters

proportion_1
: floatThe first proportion

proportion_2
: floatThe second proportion

n_1
: intThe sample size of the first test sample

n_2
: int or None (default=None)The sample size of the second test sample. If
None
,n_1
=n_2
.
Returns

z, p
: float or None, floatReturns the zscore and the pvalue
Examples
For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/evaluate/proportion_difference/