Calibration Guide¶
Endgame provides a comprehensive calibration module covering conformal prediction,
Venn-ABERS calibration, and classical probability calibration methods. All classes
follow the sklearn interface (fit, predict, predict_proba).
Conformal Prediction (Classification)¶
ConformalClassifier wraps any classifier to produce prediction sets that contain
the true label with at least 1 - alpha marginal coverage. No distributional
assumptions are required beyond exchangeability.
from endgame.calibration import ConformalClassifier
from endgame.models import LGBMWrapper
from sklearn.model_selection import train_test_split
X_train, X_cal, y_train, y_cal = train_test_split(
X, y, test_size=0.2, random_state=42
)
base = LGBMWrapper(preset='endgame')
base.fit(X_train, y_train)
cc = ConformalClassifier(
estimator=base,
alpha=0.1, # target miscoverage rate; 90% coverage guaranteed
method='lac', # 'lac' (softmax-based) or 'aps' (adaptive prediction sets)
)
cc.fit(X_cal, y_cal) # calibrate on hold-out set
# Returns a list of sets, one per test point
prediction_sets = cc.predict(X_test)
for i, pset in enumerate(prediction_sets[:5]):
print(f"Sample {i}: possible classes = {pset}")
# Standard hard prediction uses the singleton with highest score
preds = cc.predict(X_test)
# Empirical coverage on a labelled evaluation set
cov = cc.coverage_score(X_eval, y_eval)
print(f"Empirical coverage: {cov:.3f}") # should be >= 0.90
The 'aps' score (Adaptive Prediction Sets) produces smaller, class-conditional
sets at the cost of slightly weaker marginal guarantees. Use 'lac' (Least
Ambiguous Classifier) for standard coverage.
Conformal Prediction (Regression)¶
ConformalRegressor produces prediction intervals with guaranteed marginal
coverage. The width of intervals adapts automatically to the local difficulty of
each test point when a difficulty estimator is provided.
from endgame.calibration import ConformalRegressor
from endgame.models import LGBMWrapper
base = LGBMWrapper(preset='endgame')
base.fit(X_train, y_train)
cr = ConformalRegressor(
estimator=base,
alpha=0.05, # 95% coverage
method='split', # 'split' (fast) or 'cv+' (cross-conformal, slower)
)
cr.fit(X_cal, y_cal)
# Returns a tuple of (lower, upper) arrays
lower, upper = cr.predict_interval(X_test)
widths = upper - lower
print(f"Median interval width: {np.median(widths):.4f}")
cov = cr.coverage_score(X_eval, y_eval)
print(f"Empirical coverage: {cov:.3f}")
Conformalized Quantile Regression (CQR)¶
ConformizedQuantileRegressor combines a quantile regressor with conformal
calibration to produce adaptive intervals. Intervals are wider where the model is
less certain, unlike split conformal which uses a fixed residual threshold.
from endgame.calibration import ConformizedQuantileRegressor
from endgame.models import LGBMWrapper
# Base model must support quantile regression
qr = LGBMWrapper(objective='quantile', preset='endgame')
cqr = ConformizedQuantileRegressor(
estimator=qr,
alpha=0.1, # 90% coverage target
quantile_low=0.05, # lower quantile for the base regressor
quantile_high=0.95, # upper quantile for the base regressor
)
cqr.fit(X_train, y_train, X_cal=X_cal, y_cal=y_cal)
lower, upper = cqr.predict_interval(X_test)
CQR is the recommended method when prediction intervals of varying width are
needed. The conformity score is max(q_low - y, y - q_high), so the calibration
step only stretches or shrinks the raw quantile interval by a single scalar.
Venn-ABERS Calibration¶
VennABERS produces well-calibrated probability estimates without requiring a
specific parametric form. It is guaranteed to be calibrated in a strong sense
(individual calibration) under no distributional assumptions.
from endgame.calibration import VennABERS
from endgame.models import LGBMWrapper
base = LGBMWrapper(preset='endgame')
base.fit(X_train, y_train)
va = VennABERS(estimator=base)
va.fit(X_cal, y_cal)
# Returns point probabilities (geometric mean of the interval bounds)
proba = va.predict_proba(X_test)
# Returns the full Venn-ABERS interval [p0, p1] per sample
intervals = va.predict_interval(X_test)
p0, p1 = intervals[:, 0], intervals[:, 1]
# Interval width indicates epistemic uncertainty
uncertainty = p1 - p0
Unlike Platt scaling or isotonic regression, Venn-ABERS does not require tuning and is valid for small calibration sets. It is particularly useful when the base model has poorly calibrated raw probabilities (e.g., a gradient boosting model).
Classical Probability Calibration¶
Temperature Scaling¶
Temperature scaling divides the logits of a neural network (or any model exposing
logits) by a single learnable scalar T. It is the most common post-hoc
calibration technique for deep learning.
from endgame.calibration import TemperatureScaling
ts = TemperatureScaling()
ts.fit(logits_cal, y_cal) # calibrate on logits (pre-softmax)
calibrated_proba = ts.predict_proba(logits_test)
print(f"Learned temperature: {ts.temperature_:.4f}")
Platt Scaling¶
Platt scaling fits a logistic regression on the model’s raw scores. It is effective when the raw scores are approximately normally distributed by class.
from endgame.calibration import PlattScaling
ps = PlattScaling()
ps.fit(scores_cal, y_cal) # 1D array of decision scores
calibrated_proba = ps.predict_proba(scores_test)
Beta Calibration¶
Beta calibration maps scores through a Beta CDF, offering more flexibility than Platt scaling for scores bounded in [0, 1] (e.g., already-softmaxed probabilities).
from endgame.calibration import BetaCalibration
bc = BetaCalibration()
bc.fit(proba_cal, y_cal) # uncalibrated probabilities in [0, 1]
calibrated_proba = bc.predict_proba(proba_test)
Isotonic Calibration¶
Isotonic regression fits a non-parametric monotone mapping from scores to probabilities. It can perfectly fit calibration data but may overfit with small calibration sets.
from endgame.calibration import IsotonicCalibration
ic = IsotonicCalibration()
ic.fit(proba_cal, y_cal)
calibrated_proba = ic.predict_proba(proba_test)
Evaluating Calibration Quality¶
CalibrationAnalyzer computes multiple calibration diagnostics and generates
reliability diagrams.
from endgame.calibration import CalibrationAnalyzer
analyzer = CalibrationAnalyzer(n_bins=10, strategy='uniform')
analyzer.fit(proba_test, y_test)
# Scalar metrics
print(f"ECE : {analyzer.ece_:.4f}") # Expected Calibration Error
print(f"MCE : {analyzer.mce_:.4f}") # Maximum Calibration Error
print(f"Brier: {analyzer.brier_:.4f}") # Brier Score
# Reliability diagram (matplotlib figure)
fig = analyzer.plot_reliability_diagram(title="Model Calibration")
fig.savefig("reliability.png", dpi=150)
# Per-bin breakdown
print(analyzer.bin_stats_) # DataFrame: bin_lower, bin_upper, fraction_pos, mean_conf, count
Choosing a Calibration Method¶
Method |
Best for |
|---|---|
|
Neural networks with logit access; large calibration sets |
|
SVM or other margin-based models; unimodal score distributions |
|
Models outputting probabilities; flexible boundary handling |
|
Large calibration sets; non-monotone miscalibration patterns |
|
Small calibration sets; no distributional assumptions; individual guarantees |
|
Hard prediction sets with coverage guarantees |
|
Prediction intervals with coverage guarantees |
|
Adaptive-width intervals; heteroscedastic regression |
See Also¶
Ensembles Guide for combining calibrated models
Models Guide for base model options