Calibration Guide

Endgame provides a comprehensive calibration module covering conformal prediction, Venn-ABERS calibration, and classical probability calibration methods. All classes follow the sklearn interface (fit, predict, predict_proba).

Conformal Prediction (Classification)

ConformalClassifier wraps any classifier to produce prediction sets that contain the true label with at least 1 - alpha marginal coverage. No distributional assumptions are required beyond exchangeability.

from endgame.calibration import ConformalClassifier
from endgame.models import LGBMWrapper
from sklearn.model_selection import train_test_split

X_train, X_cal, y_train, y_cal = train_test_split(
    X, y, test_size=0.2, random_state=42
)

base = LGBMWrapper(preset='endgame')
base.fit(X_train, y_train)

cc = ConformalClassifier(
    estimator=base,
    alpha=0.1,          # target miscoverage rate; 90% coverage guaranteed
    method='lac',       # 'lac' (softmax-based) or 'aps' (adaptive prediction sets)
)

cc.fit(X_cal, y_cal)   # calibrate on hold-out set

# Returns a list of sets, one per test point
prediction_sets = cc.predict(X_test)
for i, pset in enumerate(prediction_sets[:5]):
    print(f"Sample {i}: possible classes = {pset}")

# Standard hard prediction uses the singleton with highest score
preds = cc.predict(X_test)

# Empirical coverage on a labelled evaluation set
cov = cc.coverage_score(X_eval, y_eval)
print(f"Empirical coverage: {cov:.3f}")  # should be >= 0.90

The 'aps' score (Adaptive Prediction Sets) produces smaller, class-conditional sets at the cost of slightly weaker marginal guarantees. Use 'lac' (Least Ambiguous Classifier) for standard coverage.

Conformal Prediction (Regression)

ConformalRegressor produces prediction intervals with guaranteed marginal coverage. The width of intervals adapts automatically to the local difficulty of each test point when a difficulty estimator is provided.

from endgame.calibration import ConformalRegressor
from endgame.models import LGBMWrapper

base = LGBMWrapper(preset='endgame')
base.fit(X_train, y_train)

cr = ConformalRegressor(
    estimator=base,
    alpha=0.05,          # 95% coverage
    method='split',      # 'split' (fast) or 'cv+' (cross-conformal, slower)
)

cr.fit(X_cal, y_cal)

# Returns a tuple of (lower, upper) arrays
lower, upper = cr.predict_interval(X_test)

widths = upper - lower
print(f"Median interval width: {np.median(widths):.4f}")

cov = cr.coverage_score(X_eval, y_eval)
print(f"Empirical coverage: {cov:.3f}")

Conformalized Quantile Regression (CQR)

ConformizedQuantileRegressor combines a quantile regressor with conformal calibration to produce adaptive intervals. Intervals are wider where the model is less certain, unlike split conformal which uses a fixed residual threshold.

from endgame.calibration import ConformizedQuantileRegressor
from endgame.models import LGBMWrapper

# Base model must support quantile regression
qr = LGBMWrapper(objective='quantile', preset='endgame')

cqr = ConformizedQuantileRegressor(
    estimator=qr,
    alpha=0.1,           # 90% coverage target
    quantile_low=0.05,   # lower quantile for the base regressor
    quantile_high=0.95,  # upper quantile for the base regressor
)

cqr.fit(X_train, y_train, X_cal=X_cal, y_cal=y_cal)

lower, upper = cqr.predict_interval(X_test)

CQR is the recommended method when prediction intervals of varying width are needed. The conformity score is max(q_low - y, y - q_high), so the calibration step only stretches or shrinks the raw quantile interval by a single scalar.

Venn-ABERS Calibration

VennABERS produces well-calibrated probability estimates without requiring a specific parametric form. It is guaranteed to be calibrated in a strong sense (individual calibration) under no distributional assumptions.

from endgame.calibration import VennABERS
from endgame.models import LGBMWrapper

base = LGBMWrapper(preset='endgame')
base.fit(X_train, y_train)

va = VennABERS(estimator=base)
va.fit(X_cal, y_cal)

# Returns point probabilities (geometric mean of the interval bounds)
proba = va.predict_proba(X_test)

# Returns the full Venn-ABERS interval [p0, p1] per sample
intervals = va.predict_interval(X_test)
p0, p1 = intervals[:, 0], intervals[:, 1]

# Interval width indicates epistemic uncertainty
uncertainty = p1 - p0

Unlike Platt scaling or isotonic regression, Venn-ABERS does not require tuning and is valid for small calibration sets. It is particularly useful when the base model has poorly calibrated raw probabilities (e.g., a gradient boosting model).

Classical Probability Calibration

Temperature Scaling

Temperature scaling divides the logits of a neural network (or any model exposing logits) by a single learnable scalar T. It is the most common post-hoc calibration technique for deep learning.

from endgame.calibration import TemperatureScaling

ts = TemperatureScaling()
ts.fit(logits_cal, y_cal)    # calibrate on logits (pre-softmax)

calibrated_proba = ts.predict_proba(logits_test)
print(f"Learned temperature: {ts.temperature_:.4f}")

Platt Scaling

Platt scaling fits a logistic regression on the model’s raw scores. It is effective when the raw scores are approximately normally distributed by class.

from endgame.calibration import PlattScaling

ps = PlattScaling()
ps.fit(scores_cal, y_cal)    # 1D array of decision scores

calibrated_proba = ps.predict_proba(scores_test)

Beta Calibration

Beta calibration maps scores through a Beta CDF, offering more flexibility than Platt scaling for scores bounded in [0, 1] (e.g., already-softmaxed probabilities).

from endgame.calibration import BetaCalibration

bc = BetaCalibration()
bc.fit(proba_cal, y_cal)   # uncalibrated probabilities in [0, 1]

calibrated_proba = bc.predict_proba(proba_test)

Isotonic Calibration

Isotonic regression fits a non-parametric monotone mapping from scores to probabilities. It can perfectly fit calibration data but may overfit with small calibration sets.

from endgame.calibration import IsotonicCalibration

ic = IsotonicCalibration()
ic.fit(proba_cal, y_cal)

calibrated_proba = ic.predict_proba(proba_test)

Evaluating Calibration Quality

CalibrationAnalyzer computes multiple calibration diagnostics and generates reliability diagrams.

from endgame.calibration import CalibrationAnalyzer

analyzer = CalibrationAnalyzer(n_bins=10, strategy='uniform')
analyzer.fit(proba_test, y_test)

# Scalar metrics
print(f"ECE  : {analyzer.ece_:.4f}")   # Expected Calibration Error
print(f"MCE  : {analyzer.mce_:.4f}")   # Maximum Calibration Error
print(f"Brier: {analyzer.brier_:.4f}") # Brier Score

# Reliability diagram (matplotlib figure)
fig = analyzer.plot_reliability_diagram(title="Model Calibration")
fig.savefig("reliability.png", dpi=150)

# Per-bin breakdown
print(analyzer.bin_stats_)  # DataFrame: bin_lower, bin_upper, fraction_pos, mean_conf, count

Choosing a Calibration Method

Method

Best for

TemperatureScaling

Neural networks with logit access; large calibration sets

PlattScaling

SVM or other margin-based models; unimodal score distributions

BetaCalibration

Models outputting probabilities; flexible boundary handling

IsotonicCalibration

Large calibration sets; non-monotone miscalibration patterns

VennABERS

Small calibration sets; no distributional assumptions; individual guarantees

ConformalClassifier

Hard prediction sets with coverage guarantees

ConformalRegressor

Prediction intervals with coverage guarantees

ConformizedQuantileRegressor

Adaptive-width intervals; heteroscedastic regression

See Also