Everything you wanted to know about R2 but were afraid to ask

Part 1: Using a fundamental decomposition to gain insights into predictive model performance

Authors
Affiliation
Michel Friesenhahn
Christina Rabe
Courtney Schiffman
Published

October 26, 2023

Modified

October 9, 2024

There are two steps to state-of-the-art clinical prediction model performance evaluation. The first step, often referred to as overall model performance evaluation, involves interrogating the general quality of predictions in terms of how close they are to observed outcomes. Overall model performance is assessed via three general concepts that are widely applicable: accuracy, discrimination and calibration (, ).

There is some confusion around the relationship between accuracy, discrimination and calibration and what the best approach is to evaluate these concepts. This post aims to provide some clarity via a remarkable decomposition of R2 that beautifully unifies these three concepts. While similar decompositions have existed for decades in meteorological forecasting (), we wish to call greater attention to this underutilized evaluation tool and provide additional insights based on our own scaled version of the decomposition. Part 1 of this post provides details around which existing and newly proposed performance metrics to use for an overall assessment of prediction model quality. We will reveal how the suggested metrics are affected by transformations of the predicted values, explain the difference between R2 and r2, demonstrate that R2 can indeed be negative, and give practical guidance on how to compare the metrics to gain information about model performance. Part 2 of the post explains how the metrics and decompositions discussed here are applicable to binary outcomes as well and can also be generalized beyond the familiar squared error loss.

While accuracy, calibration, and discrimination provide important insights into the overall quality of the predictions, they are not direct metrics of the model’s fitness for use in a particular application. Therefore, for a complete performance evaluation, the second step is to evaluate utility using tailored metrics and visualizations. For example, if a model’s predictions are intended to be used as a covariate for a trial’s adjusted primary endpoint analysis, the effective sample size increase metric evaluates the expected precision gains from the adjustment (). Another example is net benefit decision curves, an elegant decision-analytic metric for assessing the utility in clinical practice of prognostic and diagnostic prediction models (; ). The focus of this post, however, will be only on the first step of general model performance evaluation, and further discussion around assessing a model’s clinical utility is beyond its scope.

This blog mostly focuses on defining and characterizing population level performance metrics. However, how to estimate metrics of these performance characteristics is covered in a later section.

Accuracy

The central desirable property of a prediction model is accuracy. An accurate model is one for which predicted outcomes Y^ are close to observed outcomes Y, as measured with a scoring rule. A scoring rule is a function of y^ and y that assigns a real value score s(y^,y). Without loss of generality, we follow the convention of using scoring rules with negative orientation, meaning larger scores represent greater discrepancies between y^ and y. One such example is the quadratic loss (or squared error) scoring rule, s(y^,y)=(y^y)2. Quadratic loss is one of the most common scoring rules because of its simplicity, interpretability and because it is a strictly proper scoring rule (see Part 2 of this post for more details on proper scoring rules) ().

The mean squared error (MSE) of a prediction model is its expected quadratic loss,

(1)MSE=E[(YY^)2]

We assume that the evaluation of prediction models is done on an independent test data set not used in the development of the model. Therefore, the expectation in Equation 1 would be taken over an independent test set and the error would be referred to as generalization error, as opposed to in-sample error (). We make a quick note here that it is often convenient to work with R2=1MSEVar(Y), a scaled version of MSE. R2 will be revisited later on in this post.

Figure 1 is a visual aid for understanding MSE as a metric of accuracy and what performance qualities affect it. Predicted outcomes Y^ are shown on the x-axis vs. observed outcomes Y on the y-axis for three different models (panels (a)-(c)). Red vertical lines show the distance between Y and Y^. MSE is the average of the squared lengths of these red lines.

Figure 1: Visualizing the Effects of Calibration and Discrimination on Mean Squared Error

For Model 1 in panel (a) of Figure 1, data points are relatively close to the trend line which in this case is the identity line (Y=Y^), so there is little difference between predicted and observed outcomes. The closer data points are to the identity line, the smaller the average length of the red lines and the smaller the MSE.

What would increase the distance between Y and Y^? For Model 2 in Figure 1 (b), data points still have little variability around the trend line, but now the trend line is no longer equal to the identity line. The predictions from Model 2 are systematically biased. This increases the MSE, illustrated by the increase in the average length of the red lines for Model 2 compared to Model 1.

Like Model 1, the trend line for Model 3 in Figure 1 (c) is the identity line and predictions are not systematically biased. However, there is more variability around the trend line. In other words, the predictions from Model 3 explain less of the variability in the outcome. The average length of the red lines increases as less of the outcome variability is explained by the predictions.

Figure 1 illustrates how MSE is yet another quantity that is impacted by the two concepts of bias and variance. In the context of predictive modeling, these two concepts are referred to as calibration and discrimination, respectively, and will be formally defined in the upcoming sections. Later we will derive a mathematical decomposition of accuracy into these two components to obtain a simple quantitative relationship between accuracy, calibration, and discrimination.

Calibration

Calibration is the assessment of systematic bias in a model’s predictions. The average of the observed outcomes for data points whose predictions are all y^ should be close to y^. If that is not the case, there is systematic bias in the model’s predictions and it is miscalibrated. Visually, when plotting predicted outcomes on the x-axis vs. observed outcomes on the y-axis, the model is calibrated if points are centered around the identity line Y=Y^, as they are in Figure 2 panel (b).

Figure 2: Example of two models, one calibrated and one miscalibrated, with the same discrimination

More formally, define the calibration curve to be the function C(y^)=E[YY^=y^]. A model is calibrated if and only if C(Y^)=E[YY^]=Y^. As illustrated in Figure 3, miscalibration is numerically defined to be the expected squared difference between the predictions and the calibration curve, E[(C(Y^)Y^)2]. Using the calibration curve, we define the miscalibration index (MI) as miscalibration normalized by the variance of Y. Clearly MI0 and MI=0 if and only if the model is perfectly calibrated. A model can be recalibrated by transforming its predictions via the calibration curve C(Y^) (proof in the Appendix), in other words by applying the function C(Y^)=E[YY^] to the model’s predictions.

Miscalibration Index (MI)

The Miscalibration Index (MI) is the MSE between C(Y^) and Y^ normalized by Var(Y)

(2)MI=E[(C(Y^)Y^)2]Var(Y) MI0 and MI=0 iff the model is perfectly calibrated

Figure 3: The miscalibration Index (MI) is the MSE between C(Y^) and Y^, scaled by the variance of Y

A growing number of researchers are calling for calibration evaluation to become a standard part of model performance evaluation (). This is because for many use cases calibration is an important component of prediction quality that impacts utility. For example, calibration is very important when the prediction model is intended to be used in medical practice to manage patients. Since decisions in such scenarios are based on absolute risks, systematic deviations from true risks can lead to suboptimal care. While the importance of calibration varies depending on the use case, it is simple enough to examine that it should be a routine part of model performance evaluation ().

As demonstrated in Figure 1 (c), while calibration is important, it does not assess the spread of the expected outcome values. For example, a constant prediction model Y^=E(Y) is calibrated (since E[Y|Y^=E(Y)]=E(Y)), but it has zero variance of the expected outcome values and thus no ability to discriminate between observations.

Discrimination

The two models shown in Figure 4 are both calibrated, but the model in panel (b) has better discrimination than the model in panel (a). Compared to the model in panel (a), the model in panel (b) explains the true outcome Y with less spread around the calibration curve C(Y^), which in this instance is the identity line. For a given outcome distribution Y, less spread around C(Y^) translates into a larger variation in C(Y^). Thus there are two equivalent ways to view discrimination, either as the amount of explained variability in the outcome or as the variance of the calibration curve.

Figure 4: Visualizing the effect of discrimination on MSE

To formalize this concept, define discrimination as the variance of the calibration curve, Var(C(Y^)). We then define the Discrimination Index (DI) as discrimination scaled by Var(Y).

Discrimination Index (DI)

The Discrimination Index (DI) is the variance of the calibration curve C(Y^) scaled by Var(Y)

(3)DI=Var(C(Y^))Var(Y)

0DI1

The more variability in the observed outcomes explained by the predictions, the larger the variance in C(Y^) and the larger the DI. For example, the model in Figure 4 panel (b) has a larger DI than the model in panel (a). In general, DI is a non-negative number that can be at most 1 (proof in Appendix). If a model’s predictions explain virtually all of the variability in the observed outcomes, then DI1. Unlike MI, DI is unchanged by calibrating the model (proof in Appendix).

A Fundamental Decomposition of R2

Now that calibration and discrimination are well-defined, we return to model accuracy. Recall that MSE, the expected quadratic loss over an independent test set, is an accuracy metric. We noted earlier that MSE depends on the scale of Y. It is therefore sensible to normalize the MSE by the total variance, Var(Y). Finally, subtracting from 1 gives another well-known accuracy metric, R2:

(4)R2=1MSEVar(Y)

Equation 5 shows that MSE can be decomposed into the total variation of the observed outcomes, the variance of the calibration curve, and the mean squared error between the calibration curve and predictions (see proof in the Appendix):

(5)MSE=Var(Y)Var(C(Y^))+E[(C(Y^)Y^)2]

Replacing the numerator in Equation 4 with the MSE decomposition from Equation 5, it can be easily shown how R2 is affected by miscalibration and discrimination.

The Fundamental Decomposition of R2

(6)R2=DIMI

This wonderfully simple decomposition of R2 into the difference between DI and MI disentangles the concepts of accuracy, calibration and discrimination. The decomposition reveals that R2 holds valuable information on model performance.

Some very useful observations follow directly from this decomposition:

  • Since MI0 and 0DI1, R21. In fact, R2 can be negative due to miscalibration!
  • Calibrating a model or increasing its discriminatory power increases accuracy.
  • If R2<DI, accuracy can be improved by calibrating the model. R2 will equal DI for the calibrated model.

Figure 5: Visualizing the effects of calibration and discrimination on overall accuracy

Figure 5 demonstrates four examples of the fundamental decomposition. The models in the bottom row of Figure 5 have imperfect calibration (MI>0),and thus accuracy can be improved via calibration. The top row of Figure 5 is the result of calibrating the bottom row models; DI remains the same but MI=0 after calibration and thus R2 increases. Although both models in the top row of Figure 5 have perfect calibration, the model on the right has better discrimination and thus greater R2.

Similar concepts and decompositions have been discussed since the 70’s and 80’s, in many cases developed with a focus on binary outcomes (; ; ). Part of the appeal of these decompositions, including our proposed Fundamental Decomposition, is that they apply to both continuous and binary outcomes, assuming that binary outcomes are coded with values in {0,1}. In historical and recent literature, the terms reliability, resolution and uncertainty are sometimes used instead of calibration, discrimination and variance of the observed outcome.

Linear Recalibration

We have already introduced recalibration via the calibration curve C(Y^), but there is also linear recalibration. Linear transformations are easier to estimate than unconstrained calibration curves, and may be more appropriate for smaller sample sizes.

As illustrated in Figure 6, the linear transformation of predictions that maximizes R2 is referred to as the calibration line:

L(Y^)=αopt+βoptY^ is referred to as the calibration line where

(7)(αopt,βopt)=argminα,βE[(Y(α+βY^))2]βopt=Cov(Y,Y^)Var(Y^)αopt=E[Y]βoptE[Y^] 

L(Y^) is the linear transformation that maximizes R2.

Figure 6: The calibration line and calibration curve.

Not surprisingly, αopt and βopt are the familiar population least squares linear regression coefficients for a linear model with an intercept and single predictor (proof in Appendix). αopt and βopt are also referred to as the calibration intercept and slope, respectively.

When predictions are transformed via the calibration line, R2 is equal to the squared Pearson correlation between the predictions and the observed outcomes, R2=r2=cor2(Y,Y^) (see Appendix for proof). Note that, in general, R2 and r2 are not equal, easily seen by the fact that R2 can be negative whereas 0r21. DI remains unchanged by linear recalibration, but MI becomes MI=DIR2=DIr2. We refer to DIr2 as the nonlinearity index NI, since any residual miscalibration after linear recalibration is a measure of nonlinearity in the calibration curve.

What about r2 ?

In the previous section it was noted that after linear recalibration the R2 accuracy metric equals r2, the squared Pearson correlation coefficient, and in general R2 need not equal r2. Here we note that in some settings the r2 metric is of interest in its own right. An important example is for covariate adjustment in clinical trials (Schiffman et al. ()). There, treatment effects are adjusted using predictions from a model developed based on data external to the clinical trial. r2 is a useful metric for this application because it can be used to quantify the expected precision and power gains from adjustment when the outcome is continuous. Our previous post on covariate adjustment reveals how the effective sample size increase (ESSI) from an adjusted analysis of a continuous outcome is a simple function of the squared correlations between the outcome and the covariate in the various treatment arms. For example, a squared correlation of r2=0.3 between the outcome and the predicted outcome from a prognostic model would translate into an ESSI of 43% if the predictions were adjusted for in the primary analysis. Knowing r2 allows study teams to be able to easily estimate the expected gains from covariate adjustment and to decide which covariates to adjust for.

Note that covariate adjustment is an application of prediction models where calibration is not critical. Adjusting implicitly performs linear calibration, as needed, so only very marked nonlinearity in the calibration curve will impact the effectiveness of covariate adjustment. Since such nonlinearity is not common, calibration plays little to no role for this use case and discrimination is much more important.

A Key Inequality

Table 1 summarizes the impact of transforming predictions via the calibration line and curve on R2, r2, DI and MI.

Table 1: Relationship between R2, r2 and DI
Predictor R2 r2 DI MI
Y^ R2 r2 DI DIR2
L(Y^) r2 r2 DI DIr2
C(Y^) DI DI DI 0

An extremely useful inequality for evaluating model performance follows from the first column of Table 1, and the fact that the calibration curve is the transformation that maximizes R2 (proof in Appendix).

Key Inequality

(8)R2r2DI

This key inequality has several useful implications:

  • If R2=r2=DI then MI=0 and the model is calibrated.

  • If R2<r2=DI, then the calibration curve is identical to the calibration line, but this line is not the identity. After linear recalibration using the calibration line, R2 is increased to the r2 and DI of the original predictions, Y.

  • If R2=r2<DI, then the calibration line is the identity line, but the calibration curve is not linear. So linear recalibration will not change the predictions and recalibration using the calibration curve will increase both R2 and r2 to the DI of the original predictions, Y.

  • If R2<r2<DI, then the calibration line is not the identity line and the calibration curve is not linear. Linear recalibration will increase R2 to the r2 of the original predictions; calibration using the calibration curve will increase R2 and r2 to the DI of the original predictions.

By estimating and comparing these simple metrics, one can gain a comprehensive understanding of a model’s overall performance. Comparing R2, r2 and DI provides valuable information about whether the model is calibrated, whether the calibration curve is linear or non linear, whether the calibration line is the identity line, and what happens to the R2 and r2 metrics with linear recalibration or recalibration using the calibration curve.

The Interpretation You’ve Been Looking For

Based on these results we have the following simple interpretations for R2 and DI. Note the important role calibration plays in interpreting R2:

Interpretations of R2 and DI as Proportions of Explained Uncertainty
  • DI is the proportion of variability in the observed outcomes explained by recalibrated predictions.

  • If predictions are calibrated, then R2=DI=Var(C(Y^))/Var(Y)=Var(Y^)/Var(Y), so R2 is the proportion of variability in the observed outcomes explained by the predictions.

  • If predictions are not calibrated, then R2 cannot be interpreted as the proportion of variability in the observed outcomes explained by the predictions.

  • R2 is always the proportional reduction of MSE relative to the best constant prediction, E(Y).

The above interpretations of R2 and DI may be confusing since they appear at odds with the traditional interpretation of R2 as the proportion of variance explained by the predictions when outcomes are continuous. Furthermore, it is often stated that R2=r2, whereas we have stated that R2r2, with equality if and only if the calibration line is the identity line.

The reason for the discrepancy between the traditional interpretation of R2 and what is stated here is that in the traditional setting R2 is an in-sample estimator and the predictions come from ordinary least squares regression, so R2=Var(Y^)Var(Y) and R2=r2 necessarily. However, we do not think this traditional setting is particularly relevant and what is more, the interpretation of R2 as the proportion of explained variability does not hold out of sample. Instead, we prefer to focus on out of sample performance and metrics that are applicable to any prediction model, not just models of continuous outcomes. For out of sample interpretation of performance metrics in any setting, use the simple interpretations listed in the callout block above which highlights the impact of calibration on the metrics’ interpretations.

Estimating Performance Metrics

We assume that DI, MI, R2 and r2 are estimated using independent data that were not involved in the training or selection of the model being evaluated. If an independent data set is not available, performance metrics can be estimated using an appropriate resampling method to avoid overly optimistic estimates.

Estimating DI

Recall that DI is discrimination normalized by the variance of the observed outcomes:

DI=Var(C(Y^))Var(Y)

To estimate discrimination, first estimate the calibration curve using the observed and predicted outcomes in the independent holdout data. Generalized additive models offer a nice method for estimating flexible yet smooth calibration curves. The ‘gam’ function in the r package ‘mgcv’, can be used to fit generalized additive models with regression splines. Note that observed outcomes are the dependent variable and predictions are the independent variable.

An alternative to using generalized additive models is to use nonparametric isotonic regression and the pool-adjacent-violators algorithm (PAVA) to estimate calibration curves (; ). PAVA is an appealing bin-and-count method because it is fast, non-parametric, has a monotonicity constraint for regularization, has automatic bin determination, and is a maximum likelihood estimator (; ). We find that this method is helpful for confirming the shape of the calibration curve in a robust, non-parametric manner. However, we note that PAVA is more data-hungry than generalized additive models, and may not be appropriate for smaller data sets. Furthermore, we have observed that R2 estimates based on PAVA tend to be systematically higher than direct estimates of R2 or estimates of R2 based on generalized additive models, even when the estimated calibration curves appear to be visually very similar.

For smaller sample sizes, it may be necessary to estimate only the calibration line and assume the calibration curve is equal to the calibration line. However, note that in this case the r2 and DI estimates will necessarily be equal and curvature will not be evaluated, so we must assume that departures from linearity are minor for the performance metrics to be meaningful. We note that a popular use of calibration lines is examination of their intercept and slope. While this may be helpful for some descriptive purposes, we do not recommend these to be the sole metrics used for evaluation of calibration. For further details see the Appendix.

Once the calibration curve is fit, calibrate the original predictions by transforming them via the estimated calibration curve, C^(Y^). Discrimination can then be estimated as the empirical variance of the calibrated predictions. Estimating DI is then as simple as dividing the estimated discrimination by the empirical variance of the observed outcomes:

DI^=(C^(Y^)C^(Y^))2(YY)2


# Fit calibration curve

library(mgcv)

cal_fun <- gam(observed_outcomes ~ s(original_predictions, k=3), na.action = "na.omit")

# Obtain calibrated predictions

calibrated_predictions <- predict(cal_fun, data.frame(original_predictions = original_predictions))

# Estimate DI

DI <- var(calibrated_predictions)/var(observed_outcomes)

Estimating MI

The calibrated predictions will also be used to estimate the miscalibration index. Recall that MI is miscalibration normalized by the variance of the observed outcomes:

MI=E[(C(Y^)Y^)2]Var(Y)

To estimate miscalibration, simply take the empirical mean of the squared differences between the original predictions and the predictions transformed by the calibration curve. Divide by the empirical variance of the observed outcomes to estimate MI:

MI^=(C^(Y^)Y^)2(YY)2


# Estimate MI

MI <- mean((calibrated_predictions - original_predictions)^2)/var(observed_outcomes)

Estimating R2

Once DI^ and MI^ are calculated using the calibration curve, an estimate of R2 based on the calibration curve can then easily be obtained by subtracting MI^ from DI^. Denote this estimate of R2 based on the calibration curve with a subscript ‘C’ for ‘calibration’:

RC2^=DI^MI^

Note that R2 can also be estimated directly, without needing to first estimate a calibration curve, by calculating one minus the empirical mean squared error between the original predictions and the observed outcomes divided by the variance of the observed outcome. Denote this direct estimate of R2 with a subscript ‘D’ for ‘direct’. In our experience, RC2^ is nearly identical to RD2^ when the calibration curve has been estimated using the default smoother in the ‘gam’ function from the ‘mgcv’ R package.

RD2^=1(YY^)2(YY)2


# Estimate of R^2 based on the calibration curve

R2 <- DI - MI

# Direct estimate of R^2

R2_direct <- 1 - (sum((observed_outcomes - original_predictions)^2) / ((length(original_predictions) - 1) * var(observed_outcomes)))

Estimating r2

To estimate r2, simply calculate the squared Pearson correlation coefficient between the observed outcomes and the original predictions:

r2^=cor(Y,Y^original)2



# Direct estimate of r^2

r2_direct <- cor(original_predictions, observed_outcomes)^2

Example

Geographic atrophy (GA) is an advanced form of age-related macular degeneration (AMD) that leads to vision loss. GA progression can be assessed by the change in GA lesion area (mm2) over time using Fundus Autofluorescence (FAF) images. Salvi et al. use data from several clinical trials and observational studies to develop deep learning (DL) models that can predict the future region of growth (ROG) of GA lesions at l-year using FAF images ().

To develop and evaluate the models, the data were split into a development set (n=388) and test set (n=209). The development set was further split into a training (n=310) and validation (n=78) set. Models were built using the training set, selected using the validation set and evaluated using the test set. See Salvi et al. for further details around the development of the various DL models.

To demonstrate how to use the previously discussed metrics to evaluate model performance, we estimate the metrics for one of the DL models from this publication (referred to as model #5 multiclass whole lesion in the manuscript), before and after recalibration based on the test set ().

Table 2 shows the metric estimates for the model in the test set. The first row of Table 2 shows the metric estimates prior to recalibration. The model is imperfectly calibrated, demonstrated visually with the calibration plot in Figure 7 and with the relatively high estimate of the miscalibration index (MI^=0.338). However, the estimated discrimination index is relatively high, DI^=0.678. The fundamental decomposition tells us that R2^ could reach this value if the model were calibrated. The shape of the calibration curve and the low estimate of the nonlinearity index (NI^=0.030) indicate that linear recalibration would go a long way in improving the model’s accuracy

Figure 7: Predicted vs observed outcomes in the test set for the whole lesion model from Salvi et al.

Table 2: Performance metrics estimated on the test set based on original, linearly recalibrated and recalibrated results
Predictor R2^ r2^ DI^ MI^ NI^
Y^ 0.338 0.648 0.678 0.338 0.030
L(Y^) 0.648 0.648 0.678 0.030 0.030
C(Y^) 0.678 0.678 0.678 0.000 0.000

The second row of Table 2 shows the metric estimates after linearly recalibrating the model. The miscalibration index decreases to MI^=0.030 and R2^ increases to R2^=r2^=0.648, as expected. Since recalibration does not affect discrimination, DI^ remains the same. Finally, the bottom row of Table 2 shows the metric estimates after recalibrating the model with the estimated calibration curve (the red gam curve in Figure 7). As expected, after transforming the predictions via C(Y^), MI^=0 and R2^ has increased to DI^.

Summary/Key Points

  • The general quality of predictions is assessed by accuracy, calibration, and discrimination. These three evaluation domains are often assessed completely independently of each other.

  • We propose an approach that unifies these concepts via a scaled decomposition of accuracy into miscalibration and discrimination, R2=DIMI. This decomposition is unitless and clarifies that DI is the accuracy when there is no miscalibration, equivalently when the predictions are re-calibrated using the calibration curve.

  • Interestingly, r2 can be interpreted as the R2 for the best linear recalibration. That metric is also inherently of interest for some applications such as covariate adjustment.

  • The three key metrics are R2, r2 and DI and they satisfy a key inequality, R2r2DI. The remaining metrics of MI and NI are derived from these.

  • Discrimination can never be improved via recalibration, but miscalibration and accuracy can. These metrics are very informative for assessing how much accuracy can be improved via linear recalibration and recalibration via the calibration curve.

To understand how the metrics and decompositions apply to binary outcomes and generalize beyond quadratic loss, see Part 2 of this post.

Appendix

Appendix A

A Note on the Intercept and Slope of L(Y^)

Calibration is often assessed by reporting the intercept and slope of the calibration line L(Y^) (; ; ). An intercept of 0 and slope of 1 correspond to perfect calibration if the calibration curve is linear. The intercept of L(Y^) is generally interpreted as a measure of “calibration in the large”, i.e. how different the mean outcome is from the mean prediction. The slope of L(Y^) can be interpreted as a measure of overfitting (slope < 1) or underfitting (slope > 1) during model development and when performing internal validation.

However, as metrics for miscalibration, the intercept and slope of L(Y^) have limitations. If the calibration function is nonlinear, calibration performance may not be well assessed by the calibration line. The MI metric discussed above, however, is appropriate for any shape the calibration function may have. The presence of nonlinear miscalibration can be assessed via a Nonlinearity Index NI=DIr2. NI>=0 by the fundamental inequality and NI=0 if and only if the calibration curve is linear. When NI>0, it indicates how much recalibration via the calibration curve improves accuracy over recalibration via the calibration line.

Furthermore, since there are two parameters, model comparisons for miscalibration using the intercept and slope of L(Y^) are only partially ordered since they cannot be compared based on these parameters if one has a better intercept and a worse slope than the other. MI, on the other hand, is a single metric that enables comparison of any two models.

Even if the calibration curve is linear and the prediction model is calibrated in the large, there is an additional issue: the calibration intercept and slope do not take into account the distribution of the predicted outcomes. Unless the slope of the calibration line is 1, the discrepancies between the predicted outcomes and the calibration line will depend on the predicted outcomes and the distribution of discrepancies will depend on the distribution of predictions. MI accounts for this since it captures the expected squared discrepancies over the distribution of predictions, but the intercept and slope of L(Y^) do not.

For these reasons, we prefer to do the following to gain a comprehensive understanding of miscalibration:

  • Focus on R2, r2, DI, MI and NI
  • If the evaluation data set is too small to estimate a flexible calibration curve, estimate a calibration line and calculate the above metrics using that line as the calibration curve. Since r2=DI and NI=0 in this case, only a subset of the above metrics needs to be reported. Note, MI should still be reported even if recalibration was done with a linear function.
  • If NI0, the intercept and slope of L(Y^) may also be reported for the purposes of describing or approximating the calibration curve.

Appendix B

Proof that predictions transformed via the calibration curve are calibrated

If a model is miscalibrated, its predictions can be transformed via the calibration curve CY,Y^. To show this, note that the calibration curve for the new predictions CY,Y^(Y^) is the identity function:

CY,CY,Y^(Y^)(CY,Y^(Y^))=E[YCY,Y^(Y^)]=E[YE[YY^]]=E[E[YY^,E[YY^]]E[YY^]]=E[E[YY^]E[YY^]]=E[YY^]=CY,Y^(Y^)

And therefore MI=0 after transforming the original predictions with CY,Y^:

(9)MI=E[(CY,CY,Y^(Y^)(CY,Y^(Y^))CY,Y^(Y^))2]Var(Y)=E[(CY,Y^(Y^)CY,Y^(Y^))2]Var(Y)=0

Proof that 0DI1

In general, the DI is a non-negative number that can be at most 1 since

(10)DI=Var(CY,Y^(Y^))Var(Y)=Var(E[Y|Y^])Var(Y)=Vary(Y)E(Var[Y|Y^])Var(Y)1

Proof that transformations do not affect DI

(11)DI of calibrated predictions=Var(CY,CY,Y^(Y^)(CY,Y^(Y^)))Var(Y)=Var(CY,Y^(Y^))Var(Y)=DI of original predictions

Proof of the MSE decomposition

MSE=E[(YY^)2]=E[E[(YY^)2Y^]]=E[E[(YE[YY^]+E[YY^]Y^)2Y^]]=E[E[(YE[YY^])2Y^]]+E[E[(E[YY^]Y^)2Y^]]=E[Var(YY^)]+E[(E[YY^]Y^)2]=Var(Y)Var(E[YY^])+E[(E[YY^]Y^)2]=Var(Y)Var(CY,Y^(Y^))+E[(CY,Y^(Y^)Y^)2]

Proof of the fundamental decomposition of R2

(12)R2=1MSEVar(Y)=1Var(Y)Var(CY,Y^(Y^))+E[(CY,Y^(Y^)Y^)2]Var(Y)=Var(CY,Y^(Y^))Var(Y)E[(CY,Y^(Y^)Y^)2]Var(Y)=DIMI

Proof that CY,Y^(Y^) maximizes R2

The calibration curve is the transformation that maximizes R2

(13)C(Y^)=argmaxhH 1MSEVar(Y)=argminhH E[(Yh(Y^))2]

argmaxhH 1MSEVar(Y)=argminhH E[(Yh(Y^))2]=argminhH E[(YE[Y|Y^]+E[Y|Y^]h(Y^))2]=argminhH E[(YE[Y|Y^])2+2(YE[Y|Y^])(E[Y|Y^]h(Y^))+(E[Y|Y^]h(Y^))2]=argminhH E[(E[Y|Y^]h(Y^))2]=E[Y|Y^]

Proof that R2=r2 when predictions are transformed via the calibration line L(Y^)

R2 for L(Y^)=1E[(YL(Y^))2]Var(Y)=Var(Y)E[(YE(Y)+Cov(Y,Y^)Var(Y^)E(Y^)Cov(Y,Y^)Var(Y^)Y^)2]Var(Y)=Var(Y)E[Var(Y)2Cov(Y,Y^)Var(Y^)Cov(Y,Y^)+Cov2(Y,Y^)Var2(Y^)Var(Y^)]Var(Y)=Cov2(Y,Y^)Var(Y^)Var(Y)=cor2(Y,Y^)

Proof that L(Y^) is the population least squares regression line

Let Y be an n×1 outcome vector and X be an n×2 design matrix with an intercept, and suppose you want to approximate E(Y|X) with a simple linear function of X:

E(Y|X)Xb

The coefficients that minimize the expected squared error of the approximation are called the population least squares regression coefficients β:

β=argminbR2E(E(Y|X)Xb)2

Taking the derivative w.r.t b and setting equal to zero:

E(2XE(Y|X)+2(XX)β)=0β=E(XX)1E(XY)=(E(Y)Cov(Y,Y^)Var(Y^)E(Y^),Cov(Y,Y^)Var(Y^))=(αopt,βopt)

References

Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T., & Silverman, E. (1955). An empirical distribution function for sampling with incomplete information. Ann. Math. Statist.
Calster, B. V., McLernon, D. J., Smeden, M. van, Wynants, L., & Steyerberg, E. W. (2019). Calibration: The achilles heel of predictive analytics. BMC Medicine.
Calster, B. V., Nieboer, D., Vergouwe, Y., Cock, B. D., Pencina, M. J., & Steyerberg, E. W. (2016). A calibration hierarchy for risk models was defined: From utopia to empirical data. Journal of Clinical Epidemiology.
Calster, B. V., Wynants, L., Verbeek, J. F. M., Verbakel, J. Y., Christodoulou, E., Vickers, A. J., Roobol, M. J., & Steyerberg, E. W. (2018). Reporting and interpreting decision curve analysis: A guide for investigators. European Urology.
DeGroot, M. H., & Fienberg, S. E. (1983). Comparing probability forecasters: Basic binary concepts and multivariate extensions. Defense Technical Information Center.
Dimitriadis, T., Gneitingb, T., & Jordan, A. I. (2021). Stable reliability diagrams for probabilistic classifiers. PNAS.
Fissler, T., Lorentzen, C., & Mayer, M. (2022). Model comparison and calibration assessment. arXiv.
Leeuw, J. de, Hornik, K., & Mair, P. (2009). Isotone optimization in r: Pool-adjacent-violators algorithm (PAVA) and active set methods. Journal of Statistical Software.
Murphy, A. H. (1973). A new vector partition of the probability score. Journal of Applied Meteorology and Climatology.
Salvi, A., Cluceru, J., Gao, S., Rabe, C., Yang, Q., Lee, A., Keane, P., Sadda, S., Holz, F. G., Ferrara, D., & Anegondi, N. (2023). Deep learning to predict the future growth of geographic atrophy from fundus autofluorescence. In Review.
Schiffman, C., Friesenhahn, M., & Rabe, C. (2023, October 26). How to Get the Most Out of Prognostic Baseline Variables in Clinical Trials. Retrieved from https://www.stats4datascience.com/posts/covariate_adjustment/
Stevensa, R. J., & Poppe, K. K. (2020). Validation of clinical prediction models: What does the "calibration slope"" really measure? Journal of Clinical Epidemiology.
Steyerberg, E. W., & Vergouwe, Y. (2014). Towards better clinical prediction models: Seven steps for development and an ABCD for validation. European Heart Journal.
Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M. J., & Michael W. Kattan, and. (2010). Assessing the performance of prediction models: A framework for some traditional and novel measures. Epidemiology.
Vickers, A. J., & Elkin, E. B. (2006). Decision curve analysis: A novel method for evaluating prediction models. Med Decis Making.

Citation

BibTeX citation:
@misc{friesenhahn,christinarabeandcourtneyschiffman2023,
  author = {Michel Friesenhahn, Christina Rabe and Courtney Schiffman},
  title = {Everything You Wanted to Know about {R2} but Were Afraid to
    Ask. {Part} 1, {Using} a Fundamental Decomposition to Gain Insights
    into Predictive Model Performance.},
  date = {2023-10-26},
  url = {https://www.go.roche.com/stats4datascience},
  langid = {en}
}
For attribution, please cite this work as:
Michel Friesenhahn, Christina Rabe and Courtney Schiffman. (2023, October 26). Everything you wanted to know about R2 but were afraid to ask. Part 1, Using a fundamental decomposition to gain insights into predictive model performance. Retrieved from https://www.go.roche.com/stats4datascience