Bayesian comparison of cross-validated algorithms

Model selection via a Bayesian correlated t-test.

ML
Model evaluation
Bayesian
Cross-validation
Cost sensitive analysis
Published

November 19, 2022

Robust model evaluation should be second nature for those of us that work with data and predictive models. When determining whether one model is better than another, there are many techniques that one can use. Very common ones include bootstrapping and cross-validation.

First and foremost however, one must make a decision as to what “better” means. Better in what way? More accurate? Faster? More (budget) efficient? We will revisit this later. For now, let’s assume that better means “obtains a higher F1 score”. F1 is just an example, and we could consider any other score. Once we have decided to say that the model with the highest F1 score is better, the more serious question arises: how do we know whether a model is indeed better? Can we just take an average score across a test set and compare the numbers, i.e. plain-old pointwise comparison? I think we agree that we would like something more robust and which, ideally, takes uncertainty into account.

We ideally want to have a probability distribution which informs us, in some way, what the probability of one model being better than the other is. That is, we want something like \(P(M_1 > M_2 | D)\), where \(M_1 > M_2\) denotes that model \(M_1\) is better than model \(M_2\) and \(D\) is the data used for the comparison. In fact, we may want to do more than that. Ideally, we would like a distribution over the true performance difference of the two algorithms. What a great motive to enter the wonderful world of Bayesian modelling1.

  • 1 See, e.g. Kruske (2013), for a nice, rapid tour of Bayesian modelling for group comparisons.

  • Problem statement

    Assume that we are interested in comparing the performance (e.g. accuracy, recall, precision, \(\text{F}_{\beta}\), etc.) of two predictors. If using cross-validation, Bouckaert (2003) recommends making this more robust (i.e. returning more stable statistics) by using repeated cross-validation; that is performing evaluation via \(m\) repetitions of \(k\)-fold cross-validation. Both classifiers will be trained on the same data and tested on the same data and we can therefore collect the differences of their performances

    \[ \delta = \{\delta_{1}, \delta_{2},. . ., \delta_{n}\} \]

    where \(n=mk\). Denote the sample mean and sample variance of the differences as \(\hat{\delta}\) and \(\hat{\sigma}^{2}\).

    What can we say about these samples? Well, one thing is that these samples are almost certainly noisy observations of an underlying true difference of performance between the two algorithms. That is, there exist a hidden performance difference which we don’t know but that manifest itself, with noise, through these samples. For instance, if we assume that the \(\delta_{i}\) s are i.i.d. then we could say that the \(\delta_{i}\) are samples from a distribution with a true underlying mean, which would be the mean of interest to us. We could then further assign a prior distribution to that true mean (and any other parameters in the specification of the sampling distribution) and, after observing some samples, perform inference to construct the posterior distribution over the true mean, given the observed data. This is the standard paradigm of Bayesian modelling.

    There are infinite ways to model such a process but we can follow the aphorism, gifted to us by George Box, that none of them will be correct, but some of them may be useful. In particular, the assumtions that we bake into the model will make the model more or less useful. Oversimplifying assumptions would lead us to a model that is not able to describe the underlying process. Complex interactions would lead to a model that is intractable. Of course these days, with powerful probabilist programming languages, it is much easier to build a complex model and have a black-box inference algorithm that will give us results. This, however, isn’t a free meal. Complex models, especially hierachical ones, lead to complex inference processes that require expertise to diagnose. Moreover, the choice of priors will also become a delicate issue.

    In general, we need to balance two things:

    1. the modelling assumptions,
    2. the techniques to perform inference, i.e. the computational techniques.

    In the case that we are considering here, i.e. comparing models across cross-validation scores, one important modelling aspect is that the scores are not independent: in fact the resulting models will have an overlapping training set and an overlapping testing set.

    In this post, we will be presenting a model that was developed in Corani and Benavoli (2015). This is a relatively simple model which uses the properties of exponential families and conjugacy to simplify inference. Note that this is just one possible model of the process, and it is by no means the best model!

    Corani and Benavoli’s Bayesian correlated t-test

    The model proposed by Corani and Benavoli (2015) is as follows. Assume that the observations \(\delta_i\), are identically distributed but dependent. Specifically, let the observations have the same mean, \(\mu\), the same precision, \(\nu,\) and be equally correlated with each other with correlation \(\rho > 0\). This is the case when the \(n\) observations are the \(n\) differences of performance between two predictors yielded by cross-validation. The data generation process is then modelled as:

    \[ \delta = I_{n \times 1}\mu + v \]

    where \(v\) is a noise vector with zero mean and covariance matrix \(\Sigma_{n \times n}\) with the following structure: each diagonal element equals \(\sigma^{2} = 1/\nu\); non-diagonal elements equal \(\rho \sigma^{2}\). This is knows as the interclass covariance matrix.

    Define \(\Sigma = \sigma^{2} M\) with \(M\) an \((n \times n)\) correlation matrix, e.g. in \(3d\)

    \[ M = \left[ {\begin{array}{cc} 1 & \rho & \rho \\ \rho & 1 & \rho\\ \rho & \rho & 1\\ \end{array} } \right] \]

    Nadeau and Bengio (2003) show that the correlation between the cross-validation results is positive and therefore, since \(\sigma^{2} > 0\), \(\Sigma\) is invertible and positive definite.

    With this we have the basic ingredient indicating how we want to model the interactions between the samples, however we still need to define a generative process; a prior over \(\mu\) (the quantity of interest) and a likelihood (i.e. a sampling distribution). For the sampling distribution of correlated observations assume the noise vector, \(v\), to be distributed as a multivariate Normal distribution:

    \[ P(\delta| \mu, \Sigma) = \frac{ \exp(-\frac{1}{2}(\delta - I\mu)^{T}\Sigma^{-1}(\delta - I\mu))} {(2\pi)^{n/2}\sqrt{|\Sigma|}} \]

    We ultimately want a distribution over \(\mu\). We could go about this by defining priors over the three parameters, \(\mu\), \(\sigma^{2}\) and \(\rho\), and inferring the posterior over these. The approach taken by the authors here is simpler and, importantly, allows us to get the posterior in closed form. This is achieved by avoiding the need to estimate \(\rho\) through use of the Nadeau-Bengio heuristic estimate for the correlation.

    Specifically Nadeau and Bengio propose to take \(\rho = \frac{n_{\text{test}}}{n_{\text{total}}}\), where \(n_{\text{test}}\) is the size of the test set and \(n_{\text{total}} = n_{\text{test}} + n_{\text{train}}\), i.e. the total dataset size.

    See Nadeau and Bengio (2003) sections 3 and 5 for details. The main points and heuristics behind this choice for the correlation factor are:

    • The real \(\rho\) is unknown, in fact usually this is just set to \(0\) meaning that correlation is not accounted for. This can lead to underestimates of the estimated variance. In the paper the authors propose \(\rho=\frac{n_{\text{test}}}{n_{\text{total}}}\). Note that this choice, as stated by the authors, is a gross approximation. Yet it is better than pretending that \(\rho=0\).

    • The choice is made with an important assumption – that the specific training instances don’t matter, just the training set size. This is a very strong assumption and one that isn’t generically justified.

    • This correction is good for stable algorithms (c.f. the point above), i.e. ones that are not too sensitive to perturbation of the training set. Or for algorithms with low capacity compared to training set size.

    • The idea is that the correlation should reduce as more training data is used. More training data should stabilise the algorithm. That is, as the data size increases the model should approach saturation and, therefore, as we keep adding data the resulting decision function shouldn’t change too much.

    Okay, so now we’re in business! With \(\rho\) fixed we can infer the posterior over \(\mu\).

    Choose the joint prior for the mean and precision parameters of the Normal distribution to be

    \[P(\mu, \nu | \mu_{0}, k_{0}, a, b) = NG(\mu, \nu; \mu_{0}, k_{0}, a, b) \]

    where \(NG\) is the standard Normal-Gamma. This is a conjugate prior, to the normal, which makes inference easier.

    Because of the conjugacy, the posterior over \(\{\mu, \nu\}\) will also be a Normal-Gamma, i.e.

    \[ P(\mu, \nu| \delta, \mu_{0}, k_{0}, a, b, \rho) = NG(\mu, \nu; \tilde{\mu}_{n}, \tilde{k}_{n}, \tilde{a}_{n}, \tilde{b}_{n}) \]

    We are interested in the distribution over \(\mu\). In order to obtain that, we have to marginalise over \(\nu\). Doing so results in a Student’s t-distribution for \(\mu\) (2015)

    \[ P(\mu|\delta, \mu_{0}, k_{0}, a, b, \rho) = St(\mu; 2\tilde{a}_{n}, \tilde{\mu}_{n}, \frac{\tilde{b}_{n}\tilde{k}_{n}}{\tilde{a}_{n}}) \]

    The expression for the parameters of the Student distribution are a little unwieldy, however if we use what is called a matching prior2 then this is simplified. We use the matching prior given by \(\mu_{0} = 0\), \(k_{0} \rightarrow \infty\), \(a = -1/2\) and \(b = 0\) to get:

  • 2 A matching prior is a prior for which posterior probability statements about the parameter also have an interpretation as confidence statements in the sampling model; i.e. the posterior will return properties that match the frequentist’s analysis.

  • \[ St(\mu; n - 1, \hat{\delta}, (\frac{1}{n} + \frac{\rho}{1 - \rho})\hat{\sigma}^{2}) \]

    where \(\hat{\delta} = \frac{\Sigma_{i = 1}^{n} \delta_{i}}{n}\) and \(\hat{\sigma}^{2} = \frac{\Sigma_{i = 1}^{n}(\delta_{i} - \hat{\delta})^{2}}{n - 1}\)

    Time to code this up!

    from time import perf_counter
    import matplotlib.pyplot as plt
    import numpy as np
    from scipy import stats
    from sklearn.datasets import make_moons
    from sklearn.model_selection import RepeatedStratifiedKFold
    from sklearn.neural_network import MLPClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import f1_score

    To see this in action, the first thing we need is some data and repeated cross-validation predictions for two different algorithms.

    We use sklearn to make a moons dataset:

    X, y = make_moons(n_samples=600, noise=0.4, random_state=23)

    This has an equal number of labels of each class.

    Now that we have the data, let’s run the repeated cross-validation for two models. Here we use a Random Forest and a neural network (MLPClassifier) classifier:

    n_repetitions = 10
    n_folds = 5
    
    RSKF = RepeatedStratifiedKFold(
      n_splits=n_folds, n_repeats=n_repetitions, random_state=23
    )
    
    model_1 = "RF"
    model_2 = "MLP"
    models = {
      model_1: RandomForestClassifier(),
      model_2: MLPClassifier(alpha=1, max_iter=1_000)
    }
    
    model_scores = {}
    model_times = {}
    
    # * loop over the models
    for model_name, model in models.items():
      times_fit = []
      scores = []
      # * loop over the repeated-cv folds
      for train_indices, test_indices in RSKF.split(X, y):
        X_train, X_test = X[train_indices], X[test_indices]
        y_train, y_test = y[train_indices], y[test_indices]
        
        # * fit model and time fitting process
        time_start = perf_counter()
        clf = model.fit(X_train, y_train)
        time_end = perf_counter()
        time_fit = time_end - time_start
    
        # * get model prediction
        y_pred = clf.predict(X_test)
    
        # * compute f1 score
        f1 = f1_score(y_test, y_pred)
    
        times_fit.append(time_fit)
        scores.append(f1)
      
      model_scores[model_name] = np.asarray(scores)
      model_times[model_name] = {"fit": times_fit}

    Now that we have the scores for each model, let’s perform the test. We’ll first need to compute the array of differences, \(\delta\), and then we’re in business.

    # * create array of differences; we do model_2 (MLP) - model_1 (RF)
    delta = model_scores[model_2] - model_scores[model_1]
    Code for compute_statistics
    def compute_statistics(
      perf_differences, n_repetitions,
      ):
      """
      Given the m*k length array holding the scores for the m-repetitions of k-folds, will compute 
      the following statistics: the mean, the Nadeau-Bengio corrected 
      variance and the number of degrees of freedom for the t-distribution.
      """
      mean = np.mean(perf_differences)
      variance = np.var(perf_differences, ddof=1)
      # * Now account for the correlations across measurements with the 
      # * Nadeau-Bengio correction of variance
      num_of_measurements = perf_differences.size
      correlation = n_repetitions / num_of_measurements
      variance *= 1 / num_of_measurements + correlation / (1 - correlation)
      return mean.item(), variance.item(), num_of_measurements - 1
    # * obtain the relevant statistics
    mean, variance, dof = compute_statistics(
      perf_differences=delta, n_repetitions=n_repetitions
    )
    t-distribution statistics:
    mean 0.0031, 
    variance 0.00012, 
    degrees of freedom 49.

    A useful thing we can do is to select a region of practical equivalence, ROPE. This is a region where the difference in performance can be considered practically equivalent, i.e. a difference lying within the ROPE is an inconsequential difference. Clearly the choice of ROPE is subjective and will depend on the metric and the scale we use to compare the algorithms in addition to our understanding of equivalence in the given situation. See Kruske (2013) and Benavoli et al. (2017) for more details.

    Here we will say that a difference of 1% in performance between the two models makes the performance practically equivalent.

    Code for get_posteriors_from_t_distribution
    def get_posteriors_from_t_distribution(
      mean, variance, dof, rope=(0.0, 0.0), precision=4
    ):
      """Compute and return probability mass to the left of the given rope, 
      within the given rope and to the right of the given rope for a 
      t-distribution specified by the given mean, variance and degrees 
      of freedom. 
      NB: probabilities are computed from the cumulative Student distribution, not 
      from a sampled posterior.
      """
      # * Deal with the situation where the variance is very small by assigning entire 
      # * probability mass to the appropriate regions
      if np.isclose(variance, 0.0):
        prob_left = float(mean < rope[0])
        prob_right = float(mean > rope[1])
      # * Otherwise compute the probability for the specified t-distribution.
      else: 
        std = np.sqrt(variance)
        prob_left = stats.t.cdf(rope[0], dof, loc=mean, scale=std)
        prob_right = 1 - stats.t.cdf(rope[1], dof, loc=mean, scale=std)
      prob_centre = 1 - prob_left - prob_right
      return [round(p, precision) for p in [prob_left, prob_centre, prob_right]]
    # * Select a Region of practical equivalence:
    ROPE = (-0.01, 0.01)
    
    prob_model_1_better, prob_rope, prob_model_2_better = get_posteriors_from_t_distribution(
      mean=mean, variance=variance, dof=dof, rope=ROPE
    )
    P(RF > MLP) = 0.1183, 
    P(RF ~ MLP) = 0.6162, 
    P(RF < MLP) = 0.2655

    And what’s nicer, we can look at the distribution as the matching prior returns a posterior that is a t-distribution:

    Since the posterior distribution informs us about the relative credibility values across the reals, from the posterior we get the uncertainty in the estimate. From this we can get a whole lot of useful information, for instance the Highest Density Intervals (HDIs), the mode, the mean, etc. Furthermore, equipped with the posterior distribution and the region of practical equivalence we can:

    1. Estimate the posterior probability of a reasonable null hypothesis, i.e. if the difference in performance is within a couple of percentage points they may well be considered equivalent. This will be given by the area within the rope region, above denoted by \(P(RF \sim MLP | D)\).
    2. Estimate the posterior probability that one model is better than the other, i.e. \(P(RF > MLP | D)\) and \(P(RF < MLP | D)\). These will be given by the areas on either side of the ROPE.
    3. Represent effect size and uncertainty.

    Cost sensitive decisions

    While we will be dealing with this in more detail in a separate blog post, let’s take a first stab.

    In the real world, it is usually the case that we want to reason and make decisions about situations based on the concept of cost. The choice of the cost measure should depend on how the system is going to be used, rather than on any inherent specification of the training process. The issue with doing this is that it is hard. It is hard because specifying a cost-aware loss function is non-trivial, because cost-specifications are domain specific, and because even in the case of roughly knowing what the costs are, using this information is hard, i.e. the specified weigthed cost may be a difficult objective for optimisers to work with. However if we can specify costs then decision making based on these would be the best way to work as this allows one to take into consideration the utility of the decision that will be made.

    In our situation, in order to do cost-sensitive analysis/decision making, all we need is to specify a cost function – we won’t need to run an optimisation on this objective function. This is a function that defines the loss (or cost) we incur in making a given decision (e.g. the wrong decision). A typical example is whether to give more importance to a false positive or a false negative. For our given situation and for the sake of exposition, let’s assume that we are interested in the time taken to fit the model as we will need to do it often. (Also because in our example, the models actually take the same time to predict.) We find that

    Median time to fit RF is 0.1127 s, 
    Median time to fit MLP is 2.5925 s, 
    Ratio MLP / RF is 23.0

    That is, MLP is a lot slower to fit than RF. There are three decisions we can make

    1. RF is better than MLP
    2. RF is equivalent to MLP
    3. RF is worse than MLP

    We consider the following cost-matrix:

    RF is better are equivalent MLP is better
    Choose RF 0 -5 2
    Choose MLP 7 5 0

    where the \((i, j)\)th entry is the cost incurred by making decision \(i\) when \(j\) is correct. Here we have a \(2 \times 3\) matrix as we only consider the options of selecting either one model or the other, no abstention, or anything else.

    In this case we have:

    • cost of choosing the Random Forest is:
      • 0 if it is better,
      • -5 if they are equivalent as we save compute time,
      • 2 if MLP is better, as we would lose performance
    • cost of choosing the MLPClassifier is:
      • 7 if RF is better because we pay for computational and performance cost,
      • 5 if they are equivalent as we add to the compute time,
      • 0 if MLP is indeed better

    The expected cost can then be obtained by multiplying the cost matrix with the relevant posterior probabilities. In this case, the relevant probabilities are \(P(RF > MLP | D)\), \(P(RF \sim MLP | D)\) and \(P(RF < MLP | D)\).

    cost_matrix = np.array([
      [0., -5., 2.],
      [7., 5., 0.],
    ])
    
    probabilities = np.array([
      prob_model_1_better, 
      prob_rope, 
      prob_model_2_better
    ])
    
    expected_cost = cost_matrix.dot(probabilities)
    
    Cost of deciding on RF: -2.55
    
    Cost of deciding on MLP: 3.9091

    The lowest expected cost would determine the optimal decision. We see that we would incur a significant cost in choosing MLP over RF. Of course, we could have made things even more extreme by specifying more aggressive costs. And further, we could have extended the possibilities, e.g. adding a row for “no decision made”.

    References

    Benavoli et al. 2017. “Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis.”
    Bouckaert, RR. 2003. “Choosing Between Two Learning Algorithms Based on Calibrated Tests.” In ICML, 3:51–58.
    Corani, G, and A Benavoli. 2015. “A Bayesian Approach for Comparing Cross-Validated Algorithms on Multiple Data Sets.” Machine Learning 100 (2): 285–304.
    Kruske, JK. 2013. “Bayesian Estimation Supersedes the t Test.” Journal of Experimental Psychology: General 142 (2): 573–603.
    Nadeau, C, and Y Bengio. 2003. “Inference for the Generalization Error.” Machine Learning 52: 239–81.

    Reuse