Benchmark¶

class endgame.benchmark.SuiteLoader(suite='sklearn-classic', max_datasets=None, max_samples=None, max_features=None, cache_dir=None, random_state=42, verbose=True)[source]¶

Bases: object

Load benchmark datasets from various sources.

Supports OpenML benchmark suites, sklearn built-in datasets, and custom datasets. Provides standardized interface for benchmark experiments.

Parameters:

suite (str or List[int]) – Suite name (e.g., “OpenML-CC18”) or list of OpenML task IDs.
max_datasets (int, optional) – Maximum number of datasets to load.
max_samples (int, optional) – Maximum samples per dataset (larger datasets are sampled).
max_features (int, optional) – Maximum features per dataset.
cache_dir (str, optional) – Directory for caching downloaded datasets.
random_state (int, default=42) – Random seed for sampling.
verbose (bool, default=True) – Enable verbose output.

Examples

>>> loader = SuiteLoader(suite="sklearn-classic")
>>> for dataset in loader.load():
...     print(f"{dataset.name}: {dataset.n_samples} samples, {dataset.n_features} features")

>>> loader = SuiteLoader(suite="OpenML-CC18", max_datasets=10)
>>> datasets = list(loader.load())

load()[source]¶

Load datasets from the suite.

Yields:: DatasetInfo – Dataset information and data.
Return type:: Generator[DatasetInfo, None, None]

static list_suites()[source]¶

List available benchmark suites.

Return type:: WSGIEnvironment[Text, Text]

static get_suite_info(suite_name)[source]¶

Get detailed information about a suite.

Return type:: WSGIEnvironment[Text, Any]
Parameters:: suite_name (str)

class endgame.benchmark.DatasetInfo(name, task_type, X, y, feature_names=<factory>, categorical_indicator=<factory>, n_samples=0, n_features=0, n_classes=0, class_distribution=<factory>, source='unknown', openml_id=None, cv_splits=None, metadata=<factory>)[source]¶

Bases: object

Container for dataset information and data.

Parameters:

name (str)
task_type (TaskType)
X (ndarray)
y (ndarray)
feature_names (list[str])
categorical_indicator (list[bool])
n_samples (int)
n_features (int)
n_classes (int)
class_distribution (dict[Any, int])
source (str)
openml_id (int | None)
cv_splits (list[tuple[ndarray, ndarray]] | None)
metadata (dict[str, Any])

name¶

Name of the dataset.

Type:: str

task_type¶

Type of ML task.

Type:: TaskType

X¶

Feature matrix.

Type:: np.ndarray

y¶

Target variable.

Type:: np.ndarray

feature_names¶

Names of features.

Type:: List[str]

categorical_indicator¶

Boolean mask indicating categorical features.

Type:: List[bool]

n_samples¶

Number of samples.

Type:: int

n_features¶

Number of features.

Type:: int

n_classes¶

Number of classes (for classification).

Type:: int

class_distribution¶

Distribution of classes.

Type:: Dict[Any, int]

source¶

Source of the dataset (e.g., ‘openml’, ‘sklearn’).

Type:: str

openml_id¶

OpenML dataset ID if applicable.

Type:: Optional[int]

cv_splits¶

Predefined cross-validation splits.

Type:: Optional[List[Tuple[np.ndarray, np.ndarray]]]

metadata¶

Additional metadata.

Type:: Dict[str, Any]

name: str¶

task_type: TaskType¶

X: ndarray¶

y: ndarray¶

feature_names: list[str]¶

categorical_indicator: list[bool]¶

n_samples: int = 0¶

n_features: int = 0¶

n_classes: int = 0¶

class_distribution: dict[Any, int]¶

source: str = 'unknown'¶

openml_id: int | None = None¶

cv_splits: list[tuple[ndarray, ndarray]] | None = None¶

metadata: dict[str, Any]¶

property n_categorical: int¶: Number of categorical features.

property n_numerical: int¶: Number of numerical features.

property imbalance_ratio: float¶: Class imbalance ratio (max_count / min_count).

get_cv_splits(n_splits=10, shuffle=True, random_state=42)[source]¶

Get cross-validation splits.

Returns predefined splits if available, otherwise generates new ones.

Return type:

list[tuple[ndarray, ndarray]]

Parameters:

n_splits (int)
shuffle (bool)
random_state (int)

class endgame.benchmark.MetaProfiler(groups=None, use_pymfe=True, landmarking_cv=3, random_state=42, verbose=False)[source]¶

Bases: object

Extract meta-features from datasets for meta-learning.

Uses pymfe when available, with fallback to pure numpy/sklearn implementations.

Parameters:

groups (List[str], optional) – Meta-feature groups to extract. Default: [“simple”, “statistical”, “info-theory”]. Options: “simple”, “statistical”, “info-theory”, “landmarking”, “complexity”.
use_pymfe (bool, default=True) – Use pymfe library when available (more comprehensive features).
landmarking_cv (int, default=3) – Number of CV folds for landmarking meta-features.
random_state (int, default=42) – Random seed for reproducibility.
verbose (bool, default=False) – Enable verbose output.

Examples

>>> profiler = MetaProfiler(groups=["simple", "statistical"])
>>> meta_features = profiler.profile(X, y)
>>> print(meta_features.features)

>>> # With landmarking
>>> profiler = MetaProfiler(groups=["simple", "landmarking"])
>>> meta_features = profiler.profile(X, y)

profile(X, y, categorical_indicator=None, task_type='classification')[source]¶

Extract meta-features from a dataset.

Parameters:

X (np.ndarray) – Feature matrix of shape (n_samples, n_features).
y (np.ndarray) – Target variable of shape (n_samples,).
categorical_indicator (List[bool], optional) – Boolean mask indicating categorical features.
task_type (str, default="classification") – Type of task: “classification” or “regression”.

Return type:

MetaFeatureSet

Returns:

MetaFeatureSet – Extracted meta-features.

get_feature_names()[source]¶

Get list of all possible meta-feature names.

Return type:: list[Text]

class endgame.benchmark.MetaFeatureSet(features=<factory>, groups=<factory>, extraction_time=0.0, errors=<factory>)[source]¶

Bases: object

Container for extracted meta-features.

Parameters:

features (dict[str, float])
groups (dict[str, list[str]])
extraction_time (float)
errors (list[str])

features¶

Dictionary of meta-feature name to value.

Type:: Dict[str, float]

groups¶

Mapping from group name to feature names in that group.

Type:: Dict[str, List[str]]

extraction_time¶

Time taken to extract features (seconds).

Type:: float

errors¶

Any errors encountered during extraction.

Type:: List[str]

features: dict[str, float]¶

groups: dict[str, list[str]]¶

extraction_time: float = 0.0¶

errors: list[str]¶

to_dict()[source]¶

Convert to dictionary.

Return type:: WSGIEnvironment[Text, float]

to_array(feature_names=None)[source]¶

Convert to numpy array.

Parameters:: feature_names (List[str], optional) – Specific features to include (in order). If None, uses all features in sorted order.
Return type:: ndarray

get_group(group)[source]¶

Get features from a specific group.

Return type:: WSGIEnvironment[Text, float]
Parameters:: group (str)

class endgame.benchmark.ExperimentTracker(name='benchmark', auto_save=False, save_path=None)[source]¶

Bases: object

Track and store experiment results.

Provides methods for logging experiments, querying results, and exporting to various formats.

Parameters:

name (str, default="benchmark") – Name for this tracking session.
auto_save (bool, default=False) – Automatically save after each experiment.
save_path (str, optional) – Path for auto-saving results.

Examples

>>> tracker = ExperimentTracker(name="my_benchmark")
>>> tracker.log_experiment(
...     dataset_name="iris",
...     model_name="RandomForest",
...     metrics={"accuracy": 0.95, "f1": 0.94},
...     hyperparameters={"n_estimators": 100},
... )
>>> df = tracker.to_dataframe()

log_experiment(dataset_name, model_name, metrics, hyperparameters=None, pipeline_config=None, meta_features=None, cv_scores=None, fit_time=0.0, predict_time=0.0, memory_mb=0.0, n_samples=0, n_features=0, task_type='classification', dataset_id=None, status='success', error_message=None, tags=None, notes='', model_structure=None)[source]¶

Log a single experiment.

Parameters:

dataset_name (str) – Name of the dataset.
model_name (str) – Name of the model/pipeline.
metrics (Dict[str, float]) – Performance metrics.
hyperparameters (Dict, optional) – Model hyperparameters.
pipeline_config (Dict, optional) – Full pipeline configuration.
meta_features (Dict, optional) – Dataset meta-features.
cv_scores (List[float], optional) – Per-fold CV scores.
fit_time (float) – Training time in seconds.
predict_time (float) – Prediction time in seconds.
memory_mb (float) – Peak memory usage in MB.
n_samples (int) – Number of samples.
n_features (int) – Number of features.
task_type (str) – Task type.
dataset_id (str, optional) – External dataset ID.
status (str) – Experiment status.
error_message (str, optional) – Error message if failed.
tags (List[str], optional) – Tags for filtering.
notes (str) – Additional notes.
model_structure (str | None)

Return type:

ExperimentRecord

Returns:

ExperimentRecord – The logged experiment record.

log_failure(dataset_name, model_name, error_message, **kwargs)[source]¶

Log a failed experiment.

Return type:

ExperimentRecord

Parameters:

dataset_name (str)
model_name (str)
error_message (str)

property records: list[ExperimentRecord]¶: Get all experiment records.

get_by_dataset(dataset_name)[source]¶

Get records for a specific dataset.

Return type:: list[ExperimentRecord]
Parameters:: dataset_name (str)

get_by_model(model_name)[source]¶

Get records for a specific model.

Return type:: list[ExperimentRecord]
Parameters:: model_name (str)

get_by_tag(tag)[source]¶

Get records with a specific tag.

Return type:: list[ExperimentRecord]
Parameters:: tag (str)

get_successful()[source]¶

Get successful experiments only.

Return type:: list[ExperimentRecord]

to_dataframe(include_meta_features=True)[source]¶

Convert to DataFrame.

Parameters:: include_meta_features (bool, default=True) – Include meta-features as columns.
Returns:: DataFrame – Polars DataFrame (or Pandas if Polars unavailable).

to_dict_list()[source]¶

Convert to list of dictionaries.

Return type:: list[WSGIEnvironment[Text, Any]]

save(path, append=False, deduplicate=True)[source]¶

Save results to file.

Parameters:

path (str) – Output path. Supports: .parquet, .csv, .json
append (bool, default=False) – If True and file exists, append new records to existing file. If False, overwrite existing file.
deduplicate (bool, default=True) – When appending, skip records with duplicate config_hash.

Return type:

None

load(path)[source]¶

Load results from file.

Parameters:: path (str) – Input path.
Return type:: ExperimentTracker
Returns:: self

summary()[source]¶

Get summary of tracked experiments.

Return type:: Text

clear()[source]¶

Clear all records.

Return type:: None

get_config_hashes()[source]¶

Get set of all config hashes in the tracker.

Return type:: set

merge(other, deduplicate=True)[source]¶

Merge another tracker into this one.

Parameters:

other (ExperimentTracker) – Tracker to merge.
deduplicate (bool, default=True) – Skip records with duplicate config_hash.

Return type:

ExperimentTracker

Returns:

self

save_to_master(path=None, deduplicate=True)[source]¶

Save results to master database, appending to existing records.

This is the primary method for building a meta-learning dataset. New experiments are appended to the master database, with duplicate configurations (same dataset + model + hyperparameters) skipped.

Parameters:

path (str or Path, optional) – Path to master database. Defaults to ~/.endgame/meta_learning_db.parquet
deduplicate (bool, default=True) – Skip records with duplicate config_hash.

Return type:

int

Returns:

int – Number of new records added.

Examples

>>> tracker = ExperimentTracker()
>>> # ... run experiments ...
>>> n_added = tracker.save_to_master()
>>> print(f"Added {n_added} new experiments to master database")

classmethod load_master(path=None)[source]¶

Load the master meta-learning database.

Parameters:: path (str or Path, optional) – Path to master database. Defaults to ~/.endgame/meta_learning_db.parquet
Return type:: ExperimentTracker
Returns:: ExperimentTracker – Tracker with all historical experiments.

Examples

>>> tracker = ExperimentTracker.load_master()
>>> print(f"Master database has {len(tracker)} experiments")

static get_master_db_path()[source]¶

Get the default master database path.

Return type:: Path
Returns:: Path – Default path: ~/.endgame/meta_learning_db.parquet

filter_existing(master_path=None)[source]¶

Find which (dataset, model) pairs already exist in master DB.

Useful for skipping already-benchmarked combinations.

Parameters:: master_path (str or Path, optional) – Path to master database.
Return type:: list[tuple[Text, Text]]
Returns:: List[Tuple[str, str]] – List of (dataset_name, model_name) pairs that exist.

class endgame.benchmark.ExperimentRecord(experiment_id='', timestamp='', dataset_name='', dataset_id=None, model_name='', pipeline_config=<factory>, hyperparameters=<factory>, metrics=<factory>, meta_features=<factory>, cv_scores=None, fit_time=0.0, predict_time=0.0, memory_mb=0.0, n_samples=0, n_features=0, task_type='classification', status='pending', error_message=None, tags=<factory>, notes='', model_structure=None, config_hash='')[source]¶

Bases: object

Single experiment record.

Parameters:

experiment_id (str)
timestamp (str)
dataset_name (str)
dataset_id (str | None)
model_name (str)
pipeline_config (dict[str, Any])
hyperparameters (dict[str, Any])
metrics (dict[str, float])
meta_features (dict[str, float])
cv_scores (list[float] | None)
fit_time (float)
predict_time (float)
memory_mb (float)
n_samples (int)
n_features (int)
task_type (str)
status (str)
error_message (str | None)
tags (list[str])
notes (str)
model_structure (str | None)
config_hash (str)

experiment_id¶

Unique identifier for this experiment.

Type:: str

timestamp¶

ISO timestamp of when the experiment was run.

Type:: str

dataset_name¶

Name of the dataset.

Type:: str

dataset_id¶

External ID (e.g., OpenML ID).

Type:: Optional[str]

model_name¶

Name/identifier of the model or pipeline.

Type:: str

pipeline_config¶

Serialized pipeline configuration.

Type:: Dict

hyperparameters¶

Model hyperparameters.

Type:: Dict

metrics¶

Performance metrics.

Type:: Dict[str, float]

meta_features¶

Dataset meta-features.

Type:: Dict[str, float]

cv_scores¶

Per-fold CV scores.

Type:: Optional[List[float]]

fit_time¶

Training time in seconds.

Type:: float

predict_time¶

Prediction time in seconds.

Type:: float

memory_mb¶

Peak memory usage in MB.

Type:: float

n_samples¶

Number of training samples.

Type:: int

n_features¶

Number of features.

Type:: int

task_type¶

Type of task.

Type:: str

status¶

Experiment status: “success”, “failed”, “timeout”.

Type:: str

error_message¶

Error message if failed.

Type:: Optional[str]

tags¶

User-defined tags.

Type:: List[str]

notes¶

Additional notes.

Type:: str

experiment_id: str = ''¶

timestamp: str = ''¶

dataset_name: str = ''¶

dataset_id: str | None = None¶

model_name: str = ''¶

pipeline_config: dict[str, Any]¶

hyperparameters: dict[str, Any]¶

metrics: dict[str, float]¶

meta_features: dict[str, float]¶

cv_scores: list[float] | None = None¶

fit_time: float = 0.0¶

predict_time: float = 0.0¶

memory_mb: float = 0.0¶

n_samples: int = 0¶

n_features: int = 0¶

task_type: str = 'classification'¶

status: str = 'pending'¶

error_message: str | None = None¶

tags: list[str]¶

notes: str = ''¶

model_structure: str | None = None¶

config_hash: str = ''¶

to_dict()[source]¶

Convert to dictionary.

Return type:: WSGIEnvironment[Text, Any]

classmethod from_dict(data)[source]¶

Create from dictionary.

Return type:: ExperimentRecord
Parameters:: data (dict[str, Any])

endgame.benchmark.get_experiment_hash(dataset_name, model_name, hyperparameters, task_type='classification')[source]¶

Generate a unique hash for an experiment configuration.

This hash is used to detect duplicate experiments in the master database. Two experiments are considered duplicates if they have the same: - dataset name - model name - hyperparameters - task type

Parameters:

dataset_name (str) – Name of the dataset.
model_name (str) – Name of the model/pipeline.
hyperparameters (Dict[str, Any]) – Model hyperparameters.
task_type (str) – Task type (classification/regression).

Return type:

Text

Returns:

str – SHA256 hash (first 16 characters) uniquely identifying this config.

class endgame.benchmark.BenchmarkRunner(suite='sklearn-classic', config=None, max_datasets=None, fast_run=False, verbose=True, **kwargs)[source]¶

Bases: object

Run systematic benchmarks across datasets and models.

Orchestrates the complete benchmark workflow: 1. Load datasets from benchmark suite 2. Profile datasets (extract meta-features) 3. Run cross-validation for each model on each dataset 4. Record results with full provenance

Parameters:

suite (str, default="sklearn-classic") – Benchmark suite name.
config (BenchmarkConfig, optional) – Full configuration object.
max_datasets (int, optional) – Override maximum number of datasets.
fast_run (bool, default=False) – Quick run with reduced settings.
verbose (bool, default=True) – Enable verbose output.
**kwargs – Additional configuration parameters.

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.linear_model import LogisticRegression
>>>
>>> models = [
...     ("RF", RandomForestClassifier(n_estimators=100, random_state=42)),
...     ("LR", LogisticRegression(max_iter=1000)),
... ]
>>>
>>> runner = BenchmarkRunner(suite="sklearn-classic")
>>> results = runner.run(models)
>>> print(results.summary())
>>>
>>> # Save results
>>> results.save("benchmark_results.parquet")

run(models, output_file=None, continue_on_error=True)[source]¶

Run benchmark on all models and datasets.

Parameters:

models (List[Union[Tuple[str, BaseEstimator], Tuple[str, BaseEstimator, BaseEstimator]]]) –
List of model specifications. Each can be either: - (name, estimator): Single estimator used for all tasks - (name, classifier, regressor): Pair of estimators, classifier used for

classification tasks and regressor for regression tasks. Either can be None to skip that task type.
output_file (str, optional) – Path to save results.
continue_on_error (bool, default=True) – Continue if a model fails on a dataset.

Return type:

ExperimentTracker

Returns:

ExperimentTracker – Tracker with all experiment results.

property tracker: ExperimentTracker¶: Get the experiment tracker.

property datasets: list[DatasetInfo]¶: Get loaded datasets.

property meta_features: dict[str, MetaFeatureSet]¶: Get extracted meta-features.

get_results_dataframe()[source]¶: Get results as DataFrame.

class endgame.benchmark.BenchmarkConfig(suite='sklearn-classic', max_datasets=None, max_samples=None, cv_folds=5, scoring_classification=<factory>, scoring_regression=<factory>, profile_datasets=True, profile_groups=<factory>, cache_meta_features=True, meta_features_cache_dir=None, timeout_per_fit=300, n_jobs=1, random_state=42, verbose=True, skip_completed=True)[source]¶

Bases: object

Configuration for benchmark runs.

Parameters:

suite (str)
max_datasets (int | None)
max_samples (int | None)
cv_folds (int)
scoring_classification (list[str])
scoring_regression (list[str])
profile_datasets (bool)
profile_groups (list[str])
cache_meta_features (bool)
meta_features_cache_dir (str | None)
timeout_per_fit (int)
n_jobs (int)
random_state (int)
verbose (bool)
skip_completed (bool)

suite¶

Benchmark suite name or list of task IDs.

Type:: str

max_datasets¶

Maximum number of datasets to run.

Type:: int, optional

max_samples¶

Maximum samples per dataset.

Type:: int, optional

cv_folds¶

Number of cross-validation folds.

Type:: int

scoring_classification¶

Metrics for classification tasks.

Type:: List[str]

scoring_regression¶

Metrics for regression tasks.

Type:: List[str]

profile_datasets¶

Whether to extract meta-features.

Type:: bool

profile_groups¶

Meta-feature groups to extract.

Type:: List[str]

cache_meta_features¶

Whether to cache meta-features to disk.

Type:: bool

meta_features_cache_dir¶

Directory to cache meta-features. Defaults to ~/.cache/endgame/meta_features.

Type:: str, optional

timeout_per_fit¶

Timeout per model fit in seconds.

Type:: int

n_jobs¶

Number of parallel jobs for CV.

Type:: int

random_state¶

Random seed.

Type:: int

verbose¶

Enable verbose output.

Type:: bool

skip_completed¶

Skip experiments that already succeeded.

Type:: bool

suite: str = 'sklearn-classic'¶

max_datasets: int | None = None¶

max_samples: int | None = None¶

cv_folds: int = 5¶

scoring_classification: list[str]¶

scoring_regression: list[str]¶

profile_datasets: bool = True¶

profile_groups: list[str]¶

cache_meta_features: bool = True¶

meta_features_cache_dir: str | None = None¶

timeout_per_fit: int = 300¶

n_jobs: int = 1¶

random_state: int = 42¶

verbose: bool = True¶

skip_completed: bool = True¶

endgame.benchmark.quick_benchmark(model, model_name='model', suite='quick-test', **kwargs)[source]¶

Quick benchmark a single model on test datasets.

Parameters:

model (BaseEstimator) – Model to benchmark.
model_name (str, default="model") – Name for the model.
suite (str, default="quick-test") – Benchmark suite.
**kwargs – Additional arguments to BenchmarkRunner.

Return type:

ExperimentTracker

Returns:

ExperimentTracker – Results tracker.

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> results = quick_benchmark(RandomForestClassifier(), "RF")
>>> print(results.summary())

endgame.benchmark.compare_models(models, suite='sklearn-classic', **kwargs)[source]¶

Compare multiple models on benchmark datasets.

Parameters:

models (List[Tuple[str, BaseEstimator]]) – List of (name, model) tuples.
suite (str, default="sklearn-classic") – Benchmark suite.
**kwargs – Additional arguments to BenchmarkRunner.

Return type:

ExperimentTracker

Returns:

ExperimentTracker – Results tracker.

class endgame.benchmark.ResultsAnalyzer(tracker, metric='accuracy', higher_is_better=True, significance_level=0.05)[source]¶

Bases: object

Analyze and compare benchmark results.

Provides methods for: - Ranking models across datasets - Statistical significance testing - Critical difference diagrams - Performance profiles - Meta-feature correlation analysis

Parameters:

tracker (ExperimentTracker) – Tracker containing experiment results.
metric (str, default="accuracy") – Primary metric for comparisons.
higher_is_better (bool, default=True) – Whether higher metric values are better.
significance_level (float, default=0.05) – Alpha level for statistical tests.

Examples

>>> analyzer = ResultsAnalyzer(tracker, metric="accuracy")
>>> rankings = analyzer.rank_models()
>>> print(rankings)
>>>
>>> # Statistical comparison
>>> comparison = analyzer.compare_models("RF", "XGBoost")
>>> print(f"P-value: {comparison.p_value}")

classmethod from_pivot(pivot, metric='accuracy', higher_is_better=True, significance_level=0.05)[source]¶

Create a ResultsAnalyzer from a pivot dict.

Convenience factory for external experiment systems that already have results in {dataset: {method: score}} form.

Parameters:

pivot (Dict[str, Dict[str, float]]) – Mapping of dataset_name -> {method_name: score}.
metric (str, default="accuracy") – Name of the metric the scores represent.
higher_is_better (bool, default=True) – Whether higher metric values are better.
significance_level (float, default=0.05) – Alpha level for statistical tests.

Return type:

ResultsAnalyzer

Returns:

ResultsAnalyzer – Analyzer ready for ranking, comparison, and statistical tests.

Examples

>>> pivot = {
...     "iris": {"RF": 0.95, "XGB": 0.96},
...     "wine": {"RF": 0.97, "XGB": 0.95},
... }
>>> analyzer = ResultsAnalyzer.from_pivot(pivot, metric="accuracy")
>>> print(analyzer.summary_table())

property df¶: Get results as DataFrame.

get_pivot_table(metric=None)[source]¶

Get pivot table of models vs datasets.

Parameters:: metric (str, optional) – Metric to use. If None, uses default metric.
Returns:: DataFrame – Pivot table with models as rows, datasets as columns.

rank_models(method=RankingMethod.MEAN_RANK, metric=None)[source]¶

Rank models across all datasets.

Parameters:

method (RankingMethod) – Ranking method to use.
metric (str, optional) – Metric to rank by.

Return type:

WSGIEnvironment[Text, float]

Returns:

Dict[str, float] – Model name to rank/score mapping (sorted).

compare_models(model_a, model_b, metric=None, test='wilcoxon')[source]¶

Compare two models statistically.

Parameters:

model_a (str) – Name of first model.
model_b (str) – Name of second model.
metric (str, optional) – Metric to compare on.
test (str, default="wilcoxon") – Statistical test: “wilcoxon”, “paired_t”, “sign”.

Return type:

ModelComparison

Returns:

ModelComparison – Comparison results.

friedman_test(metric=None)[source]¶

Perform Friedman test across all models.

Parameters:: metric (str, optional) – Metric to test on.
Return type:: tuple[float, float]
Returns:: Tuple[float, float] – (chi2 statistic, p-value)

nemenyi_critical_difference(alpha=0.05)[source]¶

Compute critical difference for Nemenyi test.

Parameters:: alpha (float, default=0.05) – Significance level.
Return type:: float
Returns:: float – Critical difference value.

get_model_summary(model_name, metric=None)[source]¶

Get detailed summary for a specific model.

Parameters:

model_name (str) – Name of the model.
metric (str, optional) – Metric to summarize.

Return type:

WSGIEnvironment[Text, Any]

Returns:

Dict[str, Any] – Summary statistics.

get_dataset_summary(dataset_name, metric=None)[source]¶

Get detailed summary for a specific dataset.

Parameters:

dataset_name (str) – Name of the dataset.
metric (str, optional) – Metric to summarize.

Return type:

WSGIEnvironment[Text, Any]

Returns:

Dict[str, Any] – Summary statistics.

summary_table(metric=None, sort_by='mean_rank')[source]¶

Generate formatted summary table.

Parameters:

metric (str, optional) – Metric to summarize.
sort_by (str, default="mean_rank") – Column to sort by.

Return type:

Text

Returns:

str – Formatted table string.

meta_feature_correlation(metric=None, model_name=None)[source]¶

Compute correlation between meta-features and performance.

Parameters:

metric (str, optional) – Performance metric.
model_name (str, optional) – Specific model to analyze. If None, averages across models.

Return type:

WSGIEnvironment[Text, float]

Returns:

Dict[str, float] – Meta-feature name to correlation mapping.

class endgame.benchmark.RankingMethod(*values)[source]¶

Bases: str, Enum

Methods for ranking models.

MEAN_SCORE = 'mean_score'¶

MEAN_RANK = 'mean_rank'¶

WIN_COUNT = 'win_count'¶

BORDA_COUNT = 'borda_count'¶

FRIEDMAN = 'friedman'¶

class endgame.benchmark.MetaLearner(approach='ranking', base_estimator=None, metric='accuracy', n_top_models=3, random_state=42, verbose=False)[source]¶

Bases: object

Learn to predict optimal models from dataset meta-features.

Trains a meta-model that predicts which model will perform best on a new dataset based on its meta-features.

Parameters:

approach (str, default="ranking") – Meta-learning approach: - “ranking”: Predict model rankings - “classification”: Predict best model (classification) - “regression”: Predict model scores (regression)
base_estimator (BaseEstimator, optional) – Base model for meta-learning. If None, uses RandomForest.
metric (str, default="accuracy") – Target metric to optimize.
n_top_models (int, default=3) – Number of top models to consider for recommendations.
random_state (int, default=42) – Random seed.
verbose (bool, default=False) – Enable verbose output.

Examples

>>> # Train meta-learner from benchmark results
>>> meta_learner = MetaLearner()
>>> meta_learner.fit(tracker)
>>>
>>> # Get recommendation for new dataset
>>> recommendation = meta_learner.recommend(X_new, y_new)
>>> print(f"Best model: {recommendation.model_name}")

fit(tracker, metric=None)[source]¶

Fit meta-learner from benchmark results.

Parameters:

tracker (ExperimentTracker) – Tracker containing benchmark results.
metric (str, optional) – Override target metric.

Return type:

MetaLearner

Returns:

self

recommend(X, y, categorical_indicator=None, task_type='classification')[source]¶

Get model recommendation for a new dataset.

Parameters:

X (np.ndarray) – Feature matrix.
y (np.ndarray) – Target variable.
categorical_indicator (List[bool], optional) – Boolean mask for categorical features.
task_type (str, default="classification") – Task type: “classification” or “regression”.

Return type:

ModelRecommendation

Returns:

ModelRecommendation – Recommended model with confidence and alternatives.

recommend_from_features(meta_features)[source]¶

Get recommendation from pre-computed meta-features.

Parameters:: meta_features (MetaFeatureSet or Dict) – Pre-computed meta-features.
Return type:: ModelRecommendation
Returns:: ModelRecommendation – Recommended model.

get_feature_importances()[source]¶

Get feature importances from meta-model.

Return type:: WSGIEnvironment[Text, float]
Returns:: Dict[str, float] – Feature name to importance mapping.

class endgame.benchmark.PipelineRecommender(meta_learner=None, preprocessing_options=None, verbose=False)[source]¶

Bases: object

Recommend complete pipelines (preprocessing + model) for new datasets.

Extends MetaLearner to recommend full preprocessing pipelines in addition to models.

Parameters:

meta_learner (MetaLearner, optional) – Pre-trained meta-learner.
preprocessing_options (List[str], default=["none", "scaling", "imputation"]) – Available preprocessing options.
verbose (bool, default=False) – Enable verbose output.

Examples

>>> recommender = PipelineRecommender()
>>> recommender.fit(tracker)
>>> pipeline = recommender.recommend_pipeline(X, y)
>>> print(pipeline)

fit(tracker, **kwargs)[source]¶

Fit recommender from benchmark results.

Return type:: PipelineRecommender
Parameters:: tracker (ExperimentTracker)

recommend_pipeline(X, y, categorical_indicator=None, task_type='classification')[source]¶

Recommend a complete pipeline.

Parameters:

X (np.ndarray) – Feature matrix.
y (np.ndarray) – Target variable.
categorical_indicator (List[bool], optional) – Boolean mask for categorical features.
task_type (str) – Task type.

Return type:

WSGIEnvironment[Text, Any]

Returns:

Dict[str, Any] – Pipeline recommendation with model and preprocessing.

class endgame.benchmark.BenchmarkReportGenerator(tracker, title='Endgame Benchmark Report')[source]¶

Bases: object

Generate HTML reports from benchmark results.

Parameters:

tracker (ExperimentTracker) – The experiment tracker with benchmark results.
title (str, optional) – Report title.

Examples

>>> from endgame.benchmark import BenchmarkRunner, BenchmarkReportGenerator
>>> runner = BenchmarkRunner(suite="sklearn-classic")
>>> tracker = runner.run(models)
>>> report = BenchmarkReportGenerator(tracker)
>>> report.generate("benchmark_report.html")

add_interpretability_output(model_name, dataset_name, output, output_type='text')[source]¶

Add interpretability output for a model.

Parameters:

model_name (str) – Name of the model.
dataset_name (str) – Name of the dataset.
output (str) – The interpretability output (rules, tree structure, equation, etc.)
output_type (str) – Type of output: “text”, “html”, “latex”, “code”

Return type:

None

generate(output_path, include_interpretability=True, include_meta_features=False)[source]¶

Generate the HTML report.

Parameters:

output_path (str) – Path to save the HTML report.
include_interpretability (bool) – Include interpretability outputs section.
include_meta_features (bool) – Include dataset meta-features section.

Return type:

Text

Returns:

str – Path to the generated report.

endgame.benchmark.extract_interpretability_outputs(models, X_sample, y_sample, dataset_name, feature_names=None)[source]¶

Extract interpretability outputs from fitted models.

Parameters:

models (List[Tuple]) – List of (name, fitted_model) tuples.
X_sample (np.ndarray) – Sample data used for fitting.
y_sample (np.ndarray) – Sample targets.
dataset_name (str) – Name of the dataset.
feature_names (List[str], optional) – Feature names for better output.

Return type:

WSGIEnvironment[Text, Text]

Returns:

Dict[str, str] – Dictionary mapping model names to their interpretability outputs.

class endgame.benchmark.LearningCurveExperiment(suite, config=None, max_datasets=None, verbose=True)[source]¶

Bases: object

Run learning curve experiments across datasets.

Implements the LCDB (Learning Curve Database) protocol for systematic evaluation of sample efficiency.

Parameters:

suite (str or List[DatasetInfo]) – Benchmark suite name or list of datasets.
config (LearningCurveConfig, optional) – Experiment configuration.
max_datasets (int, optional) – Maximum number of datasets.
verbose (bool) – Enable verbose output.

Examples

>>> from endgame.benchmark import LearningCurveExperiment, LearningCurveConfig
>>> from endgame.models import LGBMWrapper
>>>
>>> config = LearningCurveConfig(anchors=[0.1, 0.5, 1.0], n_seeds=3)
>>> exp = LearningCurveExperiment(suite="sklearn-classic", config=config)
>>>
>>> models = [
...     ("LGBM", LGBMWrapper(preset="fast")),
... ]
>>> results = exp.run(models)
>>> print(results.summary())

run(models, output_file=None, continue_on_error=True)[source]¶

Run learning curve experiments.

Parameters:

models (List[Tuple[str, BaseEstimator]]) – List of (name, model) tuples.
output_file (str, optional) – Path to save results.
continue_on_error (bool) – Continue if a model fails.

Return type:

LearningCurveResults

Returns:

LearningCurveResults – Experiment results.

class endgame.benchmark.LearningCurveConfig(anchors=<factory>, n_seeds=5, cv_folds=0, test_fraction=0.2, metrics_classification=<factory>, metrics_regression=<factory>, timeout_per_fit=600, random_state=42, verbose=True)[source]¶

Bases: object

Configuration for learning curve experiments.

Parameters:

anchors (List[float]) – Training set fractions (LCDB protocol default).
n_seeds (int) – Number of random seeds per anchor point.
cv_folds (int) – Cross-validation folds per seed (0 = holdout only).
test_fraction (float) – Holdout test set fraction.
metrics_classification (List[str]) – Metrics for classification tasks.
metrics_regression (List[str]) – Metrics for regression tasks.
timeout_per_fit (int) – Timeout per model fit in seconds.
random_state (int) – Base random seed.
verbose (bool) – Enable verbose output.

anchors: list[float]¶

n_seeds: int = 5¶

cv_folds: int = 0¶

test_fraction: float = 0.2¶

metrics_classification: list[str]¶

metrics_regression: list[str]¶

timeout_per_fit: int = 600¶

random_state: int = 42¶

verbose: bool = True¶

class endgame.benchmark.LearningCurveResults(records=<factory>, config=None)[source]¶

Bases: object

Container for learning curve results with analysis methods.

Parameters:

records (list[LearningCurveRecord])
config (LearningCurveConfig | None)

records¶

All experiment records.

Type:: List[LearningCurveRecord]

config¶

Configuration used.

Type:: LearningCurveConfig

records: list[LearningCurveRecord]¶

config: LearningCurveConfig | None = None¶

add_record(record)[source]¶

Add a record to results.

Parameters:: record (LearningCurveRecord)

to_dataframe()[source]¶

Convert results to DataFrame.

Returns:: DataFrame – Results in tabular format.

save(path)[source]¶

Save results to file.

Parameters:: path (str) – Output path (.parquet, .csv, or .json).

get_learning_curve(dataset, model, metric='accuracy')[source]¶

Get learning curve for a specific dataset/model.

Parameters:

dataset (str) – Dataset name.
model (str) – Model name.
metric (str) – Metric to retrieve.

Return type:

tuple[ndarray, ndarray, ndarray]

Returns:

anchors (ndarray) – Training fractions.
means (ndarray) – Mean metric values.
stds (ndarray) – Standard deviations.

compute_aulc(dataset, model, metric='accuracy')[source]¶

Compute Area Under Learning Curve.

Higher AULC indicates better sample efficiency (learns faster).

Parameters:

dataset (str) – Dataset name.
model (str) – Model name.
metric (str) – Metric to use.

Return type:

float

Returns:

float – Area under learning curve (normalized to [0, 1]).

summary(metric='accuracy')[source]¶

Generate summary statistics.

Parameters:: metric (str) – Primary metric for summary.
Return type:: WSGIEnvironment[Text, WSGIEnvironment[Text, float]]
Returns:: dict – Summary with AULC and final performance per model.

plot_learning_curves(dataset, metric='accuracy', models=None, ax=None, **kwargs)[source]¶

Plot learning curves for a dataset.

Parameters:

dataset (str) – Dataset name.
metric (str) – Metric to plot.
models (List[str], optional) – Models to include (default: all).
ax (matplotlib.axes.Axes, optional) – Axes to plot on.
**kwargs – Additional arguments to plt.plot.

Returns:

ax (matplotlib.axes.Axes) – The axes with the plot.

class endgame.benchmark.LearningCurveRecord(dataset_name, model_name, anchor, n_train, seed, metrics, fit_time, status='success', error_message=None)[source]¶

Bases: object

Single learning curve data point.

Parameters:

dataset_name (str)
model_name (str)
anchor (float)
n_train (int)
seed (int)
metrics (dict[str, float])
fit_time (float)
status (str)
error_message (str | None)

dataset_name¶

Name of the dataset.

Type:: str

model_name¶

Name of the model.

Type:: str

anchor¶

Training set fraction.

Type:: float

n_train¶

Actual number of training samples.

Type:: int

seed¶

Random seed used.

Type:: int

metrics¶

Performance metrics.

Type:: Dict[str, float]

fit_time¶

Training time in seconds.

Type:: float

status¶

‘success’ or ‘error’.

Type:: str

error_message¶

Error message if failed.

Type:: str, optional

dataset_name: str¶

model_name: str¶

anchor: float¶

n_train: int¶

seed: int¶

metrics: dict[str, float]¶

fit_time: float¶

status: str = 'success'¶

error_message: str | None = None¶

endgame.benchmark.quick_learning_curve(model, X, y, anchors=None, n_seeds=3, test_fraction=0.2, random_state=42)[source]¶

Quick learning curve for a single model/dataset.

Parameters:

model (BaseEstimator) – Model to evaluate.
X (ndarray) – Features.
y (ndarray) – Targets.
anchors (List[float], optional) – Training fractions.
n_seeds (int) – Seeds per anchor.
test_fraction (float) – Test set fraction.
random_state (int) – Random seed.

Return type:

tuple[ndarray, ndarray, ndarray]

Returns:

anchors (ndarray) – Training fractions.
means (ndarray) – Mean accuracies.
stds (ndarray) – Standard deviations.

endgame.benchmark.make_rotated_blobs(n_samples=1000, n_features=10, n_classes=3, rotation_angle=45.0, cluster_std=1.0, noise=0.0, random_state=None)[source]¶

Generate synthetic dataset with known rotation.

Creates Gaussian blobs that are axis-aligned in a rotated coordinate system. rotation learning should be able to recover the rotation and achieve high accuracy by axis-aligned splits in the rotated space.

This is the critical control experiment from the paper. Standard GBDTs fail on this because the decision boundaries are diagonal, while rotation learning should match MLP performance by learning the rotation.

Parameters:

n_samples (int, default=1000) – Number of samples.
n_features (int, default=10) – Number of features.
n_classes (int, default=3) – Number of classes (blob centers).
rotation_angle (float, default=45.0) – Rotation angle in degrees applied pairwise to features.
cluster_std (float, default=1.0) – Standard deviation of clusters before rotation.
noise (float, default=0.0) – Additional Gaussian noise after rotation.
random_state (int, optional) – Random seed.

Return type:

DatasetInfo

Returns:

DatasetInfo – Synthetic dataset with metadata including ground truth rotation.

Examples

>>> from endgame.benchmark.synthetic import make_rotated_blobs
>>> dataset = make_rotated_blobs(n_samples=500, rotation_angle=45.0)
>>> print(dataset.name)
synthetic_rotated_45
>>> print(dataset.metadata['ground_truth_rotation'].shape)
(10, 10)

endgame.benchmark.make_hidden_structure(n_samples=1000, n_features=20, n_informative=5, structure_type='diagonal', flip_y=0.01, random_state=None)[source]¶

Generate dataset with hidden linear structure.

The true decision boundary is simple (axis-aligned) in a rotated coordinate system. This tests whether rotation learning can discover the useful feature combinations.

Parameters:

n_samples (int, default=1000) – Number of samples.
n_features (int, default=20) – Total number of features.
n_informative (int, default=5) – Number of truly informative features.
structure_type (str, default='diagonal') – Type of hidden structure: - ‘diagonal’: Linear combination of pairs - ‘block’: Block structure in feature space - ‘random’: Random orthogonal transformation
flip_y (float, default=0.01) – Fraction of labels to flip (noise).
random_state (int, optional) – Random seed.

Return type:

DatasetInfo

Returns:

DatasetInfo – Synthetic dataset with hidden structure.

endgame.benchmark.make_xor_rotated(n_samples=1000, n_features=10, rotation_angle=45.0, noise=0.1, random_state=None)[source]¶

Generate XOR problem in rotated space.

Classic XOR problem where the decision boundary is the product of two features, but rotated so that axis-aligned trees fail.

Parameters:

n_samples (int, default=1000) – Number of samples.
n_features (int, default=10) – Total features (XOR uses first 2, rest are noise).
rotation_angle (float, default=45.0) – Rotation angle for XOR features.
noise (float, default=0.1) – Gaussian noise level.
random_state (int, optional) – Random seed.

Return type:

DatasetInfo

Returns:

DatasetInfo – XOR dataset with rotation.

endgame.benchmark.make_regression_rotated(n_samples=1000, n_features=10, n_informative=5, rotation_angle=45.0, noise=0.1, random_state=None)[source]¶

Generate regression dataset with rotated structure.

Linear regression problem where the true coefficients are axis-aligned in a rotated space.

Parameters:

n_samples (int, default=1000) – Number of samples.
n_features (int, default=10) – Total features.
n_informative (int, default=5) – Number of features with non-zero coefficients.
rotation_angle (float, default=45.0) – Rotation angle.
noise (float, default=0.1) – Target noise level.
random_state (int, optional) – Random seed.

Return type:

DatasetInfo

Returns:

DatasetInfo – Regression dataset.

endgame.benchmark.get_synthetic_suite(random_state=42)[source]¶

Get dictionary of all synthetic datasets for benchmarking.

Returns a comprehensive suite of synthetic datasets designed to test rotation learning methods.

Parameters:: random_state (int, default=42) – Random seed for reproducibility.
Return type:: WSGIEnvironment[Text, DatasetInfo]
Returns:: Dict[str, DatasetInfo] – Dictionary mapping dataset names to DatasetInfo objects.

Examples

>>> from endgame.benchmark.synthetic import get_synthetic_suite
>>> suite = get_synthetic_suite()
>>> for name, dataset in suite.items():
...     print(f"{name}: {dataset.n_samples} samples, {dataset.n_features} features")

endgame.benchmark.get_control_dataset(random_state=42)[source]¶

Get the primary control dataset from the paper.

This is the Synthetic Rotated dataset used as the critical control experiment. Standard GBDTs should fail here while rotation learning should recover the rotation and match MLP performance.

Parameters:: random_state (int, default=42) – Random seed.
Return type:: DatasetInfo
Returns:: DatasetInfo – The control dataset.