A lot of things

- Contingency Tables
- Cost Matrix
- best accuracy or $F_1$-score (see Evaluation of Binary Classifiers)
- best True Error of Model
- best Cumulative Gain Chart
- other things

What if we want to compare

- two learning algorithms $A_1$ and $A_2$
- not just two specific classifiers $C_1$ and $C_2$

We want to measure the expected error

- for $R \subset D$ where $D$ is the population, we want to measure
- $E \Big[ \text{error} \big(A_1(R), D \big) - \text{error} \big(A_2(R), D \big) \Big]$
- with $\text{error} \big(A_1(R), D \big)$ being the true error

How to estimate it?

- average result over many different training sets
- ideally all of these sets must be independent from each other
- but usually use K-Fold Cross-Validation

Estimation:

- given a data set $S$
- for $k = 1..K$,
- let $T_k$ be the training set,
- and $R_k = S - T_k$
- train $C_1 = A_1(R_k)$ and $C_2 = A_2(R_k)$
- let $\delta_k = \text{error}(C_1, T_k) - \text{error}(C_2, T_k)$

- let $\delta^* = \cfrac{1}{K} \sum_{k = 1}^K \delta_k$
- this is the estimate of the expected error $E \Big[ \text{error} \big(A_1(R), D \big) - \text{error} \big(A_2(R), D \big) \Big]$

Let's conduct a Statistical Tests of Significance:

- assume (under the null hypothesis) that $A_1$ and $A_2$ have equal expected accuracy
- $t = \cfrac{\delta^*}{\sigma}$ follows the Student distribution with $K-1$ degrees of freedom
- $\delta^* = \cfrac{1}{K} \sum_{k = 1}^K \delta_k$ - the estimate of the expected error
- $\sigma = \sqrt{ \cfrac{1}{K \cdot (K - 1)} \sum_{k = 1}^K (\delta_k - \delta^*)^2 }$

Test:

- the hypothesis $H_A$: $A_1$ and $A_2$ have equal expected error rate
- $H_A$ is accepted with $N = (1 - \alpha)\%$ confidence
- if $| t | < t_{1 - \alpha / 2, K - 1}$
- $t_{1 - \alpha / 2, K - 1}$ is the $1 - \alpha / 2$ percentile of the Student distribution with $K-1$ degrees of freedom

$k$ | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|

$\text{error}(A_1, T_k)$ | 0.12 | 0.15 | 0.14 | 0.16 | 0.11 |

$\text{error}(A_2, T_k)$ | 0.13 | 0.16 | 0.13 | 0.14 | 0.17 |

$\delta_k$ | -0.01 | -0.01 | 0.01 | 0.02 | -0.06 |

$(\delta_k - \delta^*)^2$ with $\delta^* = -0.01$ | 0.0000 | 0.0000 | 0.0004 | 0.0009 | 0.0025 |

$\cfrac{\delta^*}{\sigma} $ | -0.73 |

So,

- $\left| \cfrac{\delta^*}{\sigma} \right| = 0.73$ and $t_{97.5, 4} = 2.77$
- the hypothesis $H_A$ is accepted with 95% confidence

- given two algorithms $A_1$ and $A_2$
- 10 rounds are performed

$k$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|

$A_1$ | 0.305 | 0.322 | 0.207 | 0.206 | 0.31 | 0.41 | 0.277 | 0.26 | 0.215 | 0.26 |

$A_2$ | 0.224 | 0.145 | 0.224 | 0.196 | 0.207 | 0.204 | 0.221 | 0.194 | 0.162 | 0.35 |

So we calculate and obtain the following table

$k$ | $A_1$ | $A_2$ | $\delta_k$ | $(\delta_k - \delta^*)^2$ |
---|---|---|---|---|

1 | 0.305 | 0.224 | 0.081 | 0.00027225 |

2 | 0.322 | 0.145 | 0.177 | 0.01265625 |

3 | 0.207 | 0.224 | -0.017 | 0.00664225 |

4 | 0.206 | 0.196 | 0.01 | 0.00297025 |

5 | 0.31 | 0.207 | 0.103 | 0.00148225 |

6 | 0.41 | 0.204 | 0.206 | 0.02002225 |

7 | 0.277 | 0.221 | 0.056 | 0.00007225 |

8 | 0.26 | 0.194 | 0.066 | 0.00000225 |

9 | 0.215 | 0.162 | 0.053 | 0.00013225 |

10 | 0.26 | 0.35 | -0.09 | 0.02387025 |

$\delta^* = \cfrac{1}{10} \sum \delta_k =$ | 0.0645 | |||

$\sigma^2 = \cfrac{1}{9 \cdot 10} \sum (\delta_k – \delta^*)^2 =$ | 0.0007569167 |

$t = \cfrac{\delta^*}{\sigma^2} \approx 2.34$

- we have 9 degrees of freedom
- for confidence level $\alpha = 0.05$ $\Rightarrow$ need to look for $1 - \alpha/2 = 0.975/%$ percentile
- Calculate the $t$-value: $t_{9, 0.975} = 2.26$
- $t = 2.34 \geqslant 2.26 = t_{9, 0.975}$
- $\Rightarrow$ the hypothesis that $A_1$ and $A_2$ have the same error rate is rejected