Utils¶

endgame.utils.quadratic_weighted_kappa(y_true, y_pred, labels=None)[source]¶

Quadratic Weighted Kappa (QWK) metric.

Used in education competitions (e.g., essay scoring). Measures agreement between two ratings with quadratic weighting.

Parameters:

y_true (array-like) – True labels.
y_pred (array-like) – Predicted labels.
labels (List[int], optional) – List of labels to use for the confusion matrix.

Return type:

float

Returns:

float – QWK score in range [-1, 1], where 1 is perfect agreement.

Examples

>>> y_true = [1, 2, 3, 4, 5]
>>> y_pred = [1, 2, 3, 4, 4]
>>> qwk = quadratic_weighted_kappa(y_true, y_pred)

endgame.utils.map_at_k(y_true, y_pred, k=5)[source]¶

Mean Average Precision @ K.

For ranking competitions where each sample has multiple relevant items.

Parameters:

y_true (List[List[int]]) – List of relevant item indices for each sample.
y_pred (List[List[int]]) – List of predicted item indices (ranked) for each sample.
k (int, default=5) – Number of predictions to consider.

Return type:

float

Returns:

float – MAP@K score.

Examples

>>> y_true = [[1, 2, 3], [4, 5]]
>>> y_pred = [[1, 3, 5, 2, 4], [4, 1, 5, 2, 3]]
>>> score = map_at_k(y_true, y_pred, k=5)

endgame.utils.ndcg_at_k(y_true, y_pred, k=10)[source]¶

Normalized Discounted Cumulative Gain @ K.

Used in ranking competitions.

Parameters:

y_true (array-like) – True relevance scores.
y_pred (array-like) – Predicted scores.
k (int, default=10) – Number of predictions to consider.

Return type:

float

Returns:

float – NDCG@K score in [0, 1].

endgame.utils.competition_metric(metric_name)[source]¶

Get metric function by name.

Handles both sklearn metrics and competition-specific metrics.

Parameters:: metric_name (str) – Metric name: ‘qwk’, ‘map_at_k’, ‘ndcg’, ‘mcrmse’, etc.
Return type:: Callable
Returns:: Callable – Metric function.

class endgame.utils.SubmissionHelper(id_col='id', target_col='target', float_precision=6)[source]¶

Bases: object

Helper for generating properly formatted submission files.

Handles common submission formats for Kaggle competitions.

Parameters:

id_col (str, default='id') – Name of the ID column.
target_col (str or List[str], default='target') – Name(s) of the target column(s).
float_precision (int, default=6) – Decimal places for float values.

Examples

>>> helper = SubmissionHelper(id_col='Id', target_col='Prediction')
>>> helper.to_csv(predictions, ids, 'submission.csv')
>>> helper.validate('submission.csv', 'sample_submission.csv')

to_csv(predictions, ids=None, filepath='submission.csv', sample_submission=None)[source]¶

Generate submission CSV file.

Parameters:

predictions (array-like) – Predicted values.
ids (array-like, optional) – Sample IDs. If None, uses 0, 1, 2, …
filepath (str, default='submission.csv') – Output file path.
sample_submission (str, optional) – Path to sample submission for ID extraction.

Return type:

Text

Returns:

str – Path to generated submission file.

validate(submission_path, sample_submission_path)[source]¶

Validate submission against sample submission.

Parameters:

submission_path (str) – Path to submission file.
sample_submission_path (str) – Path to sample submission file.

Return type:

WSGIEnvironment[Text, Any]

Returns:

Dict[str, Any] – Validation results with keys: - valid: bool - errors: List[str] - warnings: List[str] - n_rows: int

from_oof_predictions(oof_models, X_test, weights=None, ids=None, filepath='submission.csv')[source]¶

Generate submission from OOF models.

Parameters:

oof_models (List) – List of trained models (from cross-validation).
X_test (array-like) – Test features.
weights (Dict[int, float], optional) – Model weights. If None, uses uniform weights.
ids (array-like, optional) – Test sample IDs.
filepath (str) – Output file path.

Return type:

Text

Returns:

str – Path to submission file.

class endgame.utils.SeedEverything(seed=42, restore=False)[source]¶

Bases: object

Context manager for reproducible experiments.

Sets random seeds on entry and optionally restores state on exit.

Parameters:

seed (int, default=42) – Random seed to use.
restore (bool, default=False) – Whether to restore random state on exit.

Examples

>>> with SeedEverything(42):
...     # Reproducible code here
...     pass

>>> seed_ctx = SeedEverything(42)
>>> with seed_ctx:
...     result = train_model()

endgame.utils.seed_everything(seed=42)[source]¶

Set random seeds for reproducibility.

Sets seeds for: - Python random - NumPy - PyTorch (if available) - TensorFlow (if available) - CUDA (if available)

Also sets environment variables for deterministic behavior.

Parameters:: seed (int, default=42) – Random seed to use.
Return type:: None

Examples

>>> from endgame.utils import seed_everything
>>> seed_everything(42)

endgame.utils.sharpe_ratio(returns, risk_free_rate=0.0, annualization_factor=252.0)[source]¶

Calculate the annualized Sharpe ratio.

Parameters:

returns (np.ndarray) – Array of periodic returns.
risk_free_rate (float, default=0.0) – Risk-free rate (same period as returns).
annualization_factor (float, default=252.0) – Factor to annualize (252 for daily, 12 for monthly, 52 for weekly).

Return type:

float

Returns:

float – Annualized Sharpe ratio.

Examples

>>> returns = np.random.randn(252) * 0.01 + 0.0005  # Daily returns
>>> sr = sharpe_ratio(returns)

endgame.utils.sharpe_ratio_std(sharpe, n_obs, skewness=0.0, kurtosis=3.0)[source]¶

Calculate the standard error of the Sharpe ratio estimate.

Uses the Lo (2002) / Mertens (2002) correction for non-normality.

Parameters:

sharpe (float) – Estimated Sharpe ratio.
n_obs (int) – Number of observations.
skewness (float, default=0.0) – Skewness of returns.
kurtosis (float, default=3.0) – Kurtosis of returns (not excess kurtosis).

Return type:

float

Returns:

float – Standard error of the Sharpe ratio.

Notes

The formula accounts for: - Sampling variability - Non-normal returns (skewness and fat tails)

References

Lo, A. (2002). “The Statistics of Sharpe Ratios.” Financial Analysts Journal, 58(4), 36-52.

endgame.utils.probabilistic_sharpe_ratio(sharpe, benchmark_sharpe, n_obs, skewness=0.0, kurtosis=3.0)[source]¶

Calculate the Probabilistic Sharpe Ratio (PSR).

PSR is the probability that the true Sharpe ratio exceeds the benchmark, accounting for non-normality of returns.

Parameters:

sharpe (float) – Estimated Sharpe ratio.
benchmark_sharpe (float) – Benchmark Sharpe ratio to compare against.
n_obs (int) – Number of observations.
skewness (float, default=0.0) – Skewness of returns.
kurtosis (float, default=3.0) – Kurtosis of returns (not excess kurtosis).

Return type:

float

Returns:

float – Probability in [0, 1] that true SR > benchmark SR.

Examples

>>> # Test if strategy beats SR = 0
>>> psr = probabilistic_sharpe_ratio(sharpe=1.5, benchmark_sharpe=0,
...                                   n_obs=252, skewness=-0.2, kurtosis=4.0)
>>> print(f"Probability true SR > 0: {psr:.2%}")

Notes

PSR corrects for: - Sample length (finite track record) - Non-normal returns (skewness and fat tails)

It does NOT correct for multiple testing - use DSR for that.

References

Bailey, D.H. and López de Prado, M. (2012). “The Sharpe Ratio Efficient Frontier.” Journal of Risk, 15(2), 3-44.

endgame.utils.expected_max_sharpe(n_trials, sharpe_std, mean_sharpe=0.0)[source]¶

Calculate expected maximum Sharpe ratio under null hypothesis.

This is the expected maximum SR when all strategies have true SR = mean_sharpe, but we observe inflated values due to multiple testing.

Parameters:

n_trials (int) – Number of independent trials/strategies tested.
sharpe_std (float) – Standard deviation of Sharpe ratio estimates across trials.
mean_sharpe (float, default=0.0) – Mean Sharpe ratio under null (typically 0).

Return type:

float

Returns:

float – Expected maximum Sharpe ratio E[max{SR_i}].

Notes

Uses the approximation from Bailey & López de Prado (2014):

E[max{SR}] ≈ μ + σ * [(1-γ)*Φ^(-1)(1-1/N) + γ*Φ^(-1)(1-1/(N*e))]

where γ is the Euler-Mascheroni constant.

Examples

>>> # After 100 trials, what SR do we expect by chance?
>>> e_max = expected_max_sharpe(n_trials=100, sharpe_std=0.5)
>>> print(f"Expected max SR: {e_max:.2f}")

endgame.utils.deflated_sharpe_ratio(sharpe, n_trials, sharpe_std_trials, n_obs, skewness=0.0, kurtosis=3.0, mean_sharpe_null=0.0)[source]¶

Calculate the Deflated Sharpe Ratio (DSR).

DSR corrects for multiple testing by computing the probability that the observed Sharpe ratio exceeds the expected maximum SR under the null hypothesis that all strategies have zero true SR.

Parameters:

sharpe (float) – Estimated Sharpe ratio of the selected strategy.
n_trials (int) – Number of independent trials/strategies tested.
sharpe_std_trials (float) – Standard deviation of Sharpe ratios across all trials.
n_obs (int) – Number of observations (track record length).
skewness (float, default=0.0) – Skewness of returns.
kurtosis (float, default=3.0) – Kurtosis of returns (not excess kurtosis).
mean_sharpe_null (float, default=0.0) – Mean Sharpe ratio under null hypothesis.

Return type:

float

Returns:

float – Deflated Sharpe Ratio in [0, 1].

Examples

>>> # Tested 100 strategies, best has SR = 2.0
>>> dsr = deflated_sharpe_ratio(
...     sharpe=2.0,
...     n_trials=100,
...     sharpe_std_trials=0.5,
...     n_obs=252,
...     skewness=-0.3,
...     kurtosis=4.5,
... )
>>> print(f"DSR: {dsr:.2%}")
>>> # If DSR < 0.95, the strategy may be a statistical fluke

Notes

DSR answers: “What is the probability that this strategy would have beaten random chance, given that we tested N strategies?”

A DSR of 0.95 means there’s a 95% probability that the strategy’s performance is real and not due to overfitting from multiple testing.

References

Bailey, D.H. and López de Prado, M. (2014). “The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality.” The Journal of Portfolio Management, 40(5), 94-107.

endgame.utils.analyze_sharpe(returns, n_trials=1, sharpe_std_trials=None, all_sharpes=None, risk_free_rate=0.0, annualization_factor=252.0, significance_level=0.05)[source]¶

Comprehensive Sharpe ratio analysis with multiple testing correction.

Parameters:

returns (np.ndarray) – Array of periodic returns for the selected strategy.
n_trials (int, default=1) – Number of independent trials/strategies tested.
sharpe_std_trials (float, optional) – Standard deviation of Sharpe ratios across all trials. If not provided and all_sharpes is given, computed from all_sharpes. If neither provided, estimated as 1/sqrt(n_obs).
all_sharpes (np.ndarray, optional) – Sharpe ratios of all tested strategies (for computing variance).
risk_free_rate (float, default=0.0) – Risk-free rate (same period as returns).
annualization_factor (float, default=252.0) – Factor to annualize Sharpe ratio.
significance_level (float, default=0.05) – Significance level for hypothesis testing.

Return type:

SharpeAnalysis

Returns:

SharpeAnalysis – Comprehensive analysis results.

Examples

>>> # Single strategy analysis
>>> returns = np.random.randn(252) * 0.01 + 0.0005
>>> analysis = analyze_sharpe(returns)
>>> print(f"SR: {analysis.sharpe_ratio:.2f}")
>>> print(f"PSR (SR > 0): {analysis.probabilistic_sharpe:.2%}")

>>> # Multiple testing scenario
>>> all_sharpes = np.random.randn(100) * 0.5  # 100 strategies tested
>>> best_idx = np.argmax(all_sharpes)
>>> analysis = analyze_sharpe(
...     returns=best_returns,
...     n_trials=100,
...     all_sharpes=all_sharpes,
... )
>>> print(f"DSR: {analysis.deflated_sharpe:.2%}")
>>> print(f"Significant: {analysis.is_significant}")

endgame.utils.minimum_track_record_length(sharpe, benchmark_sharpe=0.0, confidence=0.95, skewness=0.0, kurtosis=3.0)[source]¶

Calculate minimum track record length needed for statistical significance.

Answers: “How many observations do we need to be confident that the strategy’s Sharpe ratio is real?”

Parameters:

sharpe (float) – Target Sharpe ratio.
benchmark_sharpe (float, default=0.0) – Benchmark to beat.
confidence (float, default=0.95) – Required confidence level.
skewness (float, default=0.0) – Expected skewness of returns.
kurtosis (float, default=3.0) – Expected kurtosis of returns.

Return type:

int

Returns:

int – Minimum number of observations needed.

Examples

>>> # How long to verify SR = 1.0 strategy?
>>> n_min = minimum_track_record_length(sharpe=1.0)
>>> print(f"Need at least {n_min} observations")

Notes

This is the “MinTRL” from Bailey & López de Prado (2012).

A strategy with SR = 2.0 and normal returns needs only ~16 observations. A strategy with SR = 0.5 needs ~256 observations!

endgame.utils.haircut_sharpe_ratio(sharpe, n_trials, sharpe_std_trials=0.5)[source]¶

Apply haircut to Sharpe ratio for multiple testing.

Returns an adjusted Sharpe ratio that accounts for data mining.

Parameters:

sharpe (float) – Observed Sharpe ratio.
n_trials (int) – Number of strategies tested.
sharpe_std_trials (float, default=0.5) – Standard deviation of SR estimates across trials.

Return type:

tuple[float, float]

Returns:

Tuple[float, float] – (haircut_sharpe, haircut_percent) - haircut_sharpe: Adjusted Sharpe ratio - haircut_percent: Percentage reduction applied

Examples

>>> sr_adj, haircut = haircut_sharpe_ratio(sharpe=2.0, n_trials=100)
>>> print(f"Adjusted SR: {sr_adj:.2f} (haircut: {haircut:.1%})")

Notes

The haircut is the expected maximum SR under null hypothesis. The adjusted SR is: SR_adjusted = SR_observed - E[max{SR}|null]

endgame.utils.estimate_n_independent_trials(sharpe_ratios, method='variance')[source]¶

Estimate effective number of independent trials from correlated strategies.

When strategies are correlated, the effective number of independent trials is less than the total number tested.

Parameters:

sharpe_ratios (np.ndarray) – Array of Sharpe ratios from all tested strategies.
method (str, default="variance") – Method to estimate N: - “variance”: Use variance ratio (conservative) - “count”: Just use the raw count (anti-conservative)

Return type:

int

Returns:

int – Estimated number of independent trials.

Notes

López de Prado (2018) recommends using clustering (ONC algorithm) for more accurate estimation. This function provides simpler heuristics.

endgame.utils.multiple_testing_summary(sharpe_ratios, returns_list=None, n_obs=252, significance_level=0.05)[source]¶

Generate a summary report for multiple testing analysis.

Parameters:

sharpe_ratios (np.ndarray) – Sharpe ratios of all tested strategies.
returns_list (List[np.ndarray], optional) – List of return arrays for each strategy (for detailed stats).
n_obs (int, default=252) – Number of observations per strategy.
significance_level (float, default=0.05) – Significance level for testing.

Return type:

WSGIEnvironment

Returns:

dict – Summary statistics including: - n_trials: Total strategies tested - n_effective: Estimated independent trials - best_sharpe: Highest observed SR - expected_max: Expected max SR under null - best_dsr: DSR of best strategy - haircut: Haircut percentage - n_significant: Number passing DSR threshold

class endgame.utils.SharpeAnalysis(sharpe_ratio, probabilistic_sharpe, deflated_sharpe, expected_max_sharpe, p_value, is_significant, n_trials, skewness, kurtosis, track_record_length)[source]¶

Bases: object

Results from Sharpe ratio analysis.

Parameters:

sharpe_ratio (float)
probabilistic_sharpe (float)
deflated_sharpe (float)
expected_max_sharpe (float)
p_value (float)
is_significant (bool)
n_trials (int)
skewness (float)
kurtosis (float)
track_record_length (int)

sharpe_ratio¶

The estimated Sharpe ratio.

Type:: float

probabilistic_sharpe¶

PSR - probability that true SR > benchmark.

Type:: float

deflated_sharpe¶

DSR - PSR adjusted for multiple testing.

Type:: float

expected_max_sharpe¶

Expected maximum SR under null hypothesis.

Type:: float

p_value¶

P-value for the null hypothesis that true SR = 0.

Type:: float

is_significant¶

Whether DSR exceeds significance threshold.

Type:: bool

n_trials¶

Number of trials considered.

Type:: int

skewness¶

Skewness of returns.

Type:: float

kurtosis¶

Excess kurtosis of returns.

Type:: float

track_record_length¶

Number of observations.

Type:: int

sharpe_ratio: float¶

probabilistic_sharpe: float¶

deflated_sharpe: float¶

expected_max_sharpe: float¶

p_value: float¶

is_significant: bool¶

n_trials: int¶

skewness: float¶

kurtosis: float¶

track_record_length: int¶