MCP Server¶

Endgame ships an MCP server that lets any MCP-compatible LLM host (Claude Code, Claude Desktop, VS Code Copilot, etc.) build ML pipelines through natural language.

Instead of registering 300+ tools, the server exposes 20 meta-tools and 6 resources — keeping schema overhead under 2K tokens while giving the LLM full access to the toolkit.

Installation¶

pip install endgame-ml[mcp]
# or, if already installed:
pip install "mcp>=1.2.0"

Setup¶

Claude Code¶

Add .mcp.json to your project root (Endgame ships one by default):

{
  "mcpServers": {
    "endgame": {
      "command": "/path/to/your/.venv/bin/python",
      "args": ["-m", "endgame.mcp"]
    }
  }
}

Restart Claude Code. The server auto-starts on first tool call.

Claude Desktop¶

Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

{
  "mcpServers": {
    "endgame": {
      "command": "/path/to/your/.venv/bin/python",
      "args": ["-m", "endgame.mcp"]
    }
  }
}

Manual / Standalone¶

# stdio transport (default — used by MCP hosts)
python -m endgame.mcp

# SSE transport (for web-based clients)
python -m endgame.mcp --sse

How It Works¶

The LLM never sees 300+ model definitions. Instead:

Resources (zero-cost) let the LLM browse the model catalog, presets, metrics, and visualizers without a tool call
Discovery tools help the LLM find the right model for the dataset
Action tools load data, train, evaluate, visualize, and export
A SessionManager tracks loaded datasets, trained models, and artifacts across tool calls via short IDs (ds_a1b2c3d4, model_e5f6g7h8)

User: "Build a classifier to predict loan defaults"
  → LLM reads endgame://catalog/models        (browse 97 models)
  → LLM calls load_data(source="loans.csv", target_column="default")
  → LLM calls recommend_models(dataset_id="ds_...", time_budget="medium")
  → LLM calls train_model(dataset_id="ds_...", model_name="lgbm")
  → LLM calls evaluate_model(model_id="model_...")
  → LLM calls create_visualization(chart_type="roc_curve", model_id="model_...")
  → LLM calls export_script(model_id="model_...")

Tools Reference¶

Data (3 tools)¶

Tool	Purpose
`load_data`	Load from CSV/Parquet/URL/OpenML. Auto-detects task type. Returns dataset ID.
`inspect_data`	Explore a dataset: summary, describe, correlations, missing, distribution, head, dtypes.
`split_data`	Create stratified train/test splits. Returns two new dataset IDs.

load_data parameters:

source — File path, URL, or "openml:31" / "openml:credit-g"
target_column — Name of the target column
name — Optional display name
sample_n — Subsample to N rows

inspect_data operations:

summary — Shape, dtypes, missing values, meta-features
describe — Statistical summary (mean, std, quartiles)
correlations — Top 20 pairwise correlations
missing — Missing value counts and percentages
distribution — Value counts or quantile stats for a column
head — First 10 rows
dtypes — Column data types

Discovery (3 tools)¶

Tool	Purpose
`list_models`	Search available models by task type, family, interpretability, speed.
`recommend_models`	Smart recommendations based on dataset meta-features and time budget.
`describe_model`	Full metadata for a model (params, capabilities, speed, notes).

list_models filters:

task_type — "classification" or "regression"
family — "gbdt", "neural", "tree", "linear", "kernel", "rules", "bayesian", "foundation", "ensemble"
interpretable_only — Only glass-box models
fast_only — Exclude slow/very_slow models
max_samples — Only models that scale to N samples

Training (3 tools)¶

Tool	Purpose
`train_model`	Train a single model with cross-validation. Returns model ID + metrics.
`automl`	Full AutoML pipeline (preprocessing → training → ensembling).
`quick_compare`	Quick multi-model comparison with leaderboard.

train_model parameters:

dataset_id — From load_data
model_name — Registry key (e.g. "lgbm", "xgb", "ebm")
params — JSON string of hyperparameter overrides: '{"n_estimators": 500}'
cv_folds — Number of CV folds (default 5)
metric — Evaluation metric (default "auto")

automl presets: best_quality, high_quality, good_quality, medium_quality, fast, interpretable

Evaluation (2 tools)¶

Tool	Purpose
`evaluate_model`	Compute metrics on test data or OOF predictions.
`explain_model`	Feature importance (`importance`) or permutation importance (`permutation`).

evaluate_model metrics (comma-separated string):

Classification: accuracy, roc_auc, f1, precision, recall, balanced_accuracy, log_loss, matthews_corrcoef, cohen_kappa
Regression: rmse, r2, mae, mape, median_ae, max_error, explained_variance

Prediction (1 tool)¶

Tool	Purpose
`predict`	Generate predictions, optionally save to CSV. Supports probabilities.

Preprocessing (1 tool)¶

Tool	Purpose
`preprocess`	Chain preprocessing operations. Returns a new dataset ID.

Operations (JSON array):

[
  {"type": "impute", "strategy": "median"},
  {"type": "scale", "method": "standard"},
  {"type": "encode", "method": "label"},
  {"type": "balance", "method": "smote"},
  {"type": "select_features", "method": "mutual_info", "top_k": 20},
  {"type": "drop_columns", "columns": ["id", "name"]}
]

Visualization (2 tools)¶

Tool	Purpose
`create_visualization`	Generate a self-contained HTML chart.
`create_report`	Full classification or regression evaluation report.

Chart types:

ML evaluation: roc_curve, pr_curve, confusion_matrix, calibration_plot, lift_chart, feature_importance
Data exploration: histogram, scatterplot, heatmap, box_plot, bar_chart, line_chart

Export (2 tools)¶

Tool	Purpose
`export_script`	Generate a standalone Python script reproducing the pipeline.
`save_model`	Save trained model to disk (`.egm` format).

Advanced (3 tools)¶

Tool	Purpose
`cluster`	Clustering: `auto`, `kmeans`, `hdbscan`, `dbscan`, `agglomerative`, `gaussian_mixture`.
`detect_anomalies`	Outlier detection: `isolation_forest`, `lof`, `elliptic_envelope`.
`forecast`	Time series forecasting: `auto`/`arima`, `ets`, `theta`, `naive`.

Resources Reference¶

Resources are read-only catalogs the LLM can browse without making a tool call — zero overhead for discovery.

URI	Content
`endgame://catalog/models`	All 97 models grouped by family with name, fit time, and description
`endgame://catalog/presets`	6 AutoML presets with time limits, model pools, and settings
`endgame://catalog/visualizers`	Available chart types with required inputs
`endgame://catalog/metrics`	Classification + regression metrics with descriptions
`endgame://session/state`	Current loaded datasets, trained models, and visualizations
`endgame://guide/examples`	Example workflows for common ML tasks

Example Workflows¶

Train a single model¶

You: Load iris.csv and train a LightGBM classifier

LLM calls:
  load_data(source="iris.csv", target_column="species")
  train_model(dataset_id="ds_...", model_name="lgbm")
  evaluate_model(model_id="model_...")

Full AutoML¶

You: Run AutoML on my dataset with high quality

LLM calls:
  load_data(source="data.csv", target_column="label")
  automl(dataset_id="ds_...", preset="high_quality")

Interpretable pipeline¶

You: I need an interpretable model for regulatory compliance

LLM calls:
  load_data(source="loans.csv", target_column="default")
  list_models(task_type="classification", interpretable_only=true)
  train_model(dataset_id="ds_...", model_name="ebm")
  explain_model(model_id="model_...", method="importance")
  create_report(model_id="model_...")
  export_script(model_id="model_...")

Data exploration¶

You: Explore this dataset and show me the correlations

LLM calls:
  load_data(source="housing.csv", target_column="price")
  inspect_data(dataset_id="ds_...", operation="summary")
  inspect_data(dataset_id="ds_...", operation="correlations")
  create_visualization(chart_type="heatmap", dataset_id="ds_...")
  create_visualization(chart_type="histogram", dataset_id="ds_...", params='{"column": "price"}')

Preprocessing + training¶

You: Impute missing values, scale features, then train XGBoost

LLM calls:
  load_data(source="messy_data.csv", target_column="outcome")
  preprocess(dataset_id="ds_...", operations='[{"type":"impute","strategy":"median"},{"type":"scale","method":"standard"}]')
  train_model(dataset_id="ds_preprocessed_...", model_name="xgb")

Session Management¶

Every artifact gets a short ID:

Datasets: ds_a1b2c3d4
Models: model_e5f6g7h8
Visualizations: viz_i9j0k1l2

These IDs are passed between tools to chain operations. The endgame://session/state resource shows all current artifacts at any time.

Artifacts live in memory for the duration of the server process. Files (visualizations, exported scripts, saved models) are written to the working directory (/tmp/endgame_mcp by default, configurable via ENDGAME_MCP_WORKDIR).

Error Handling¶

All tools return structured JSON with consistent format:

// Success
{"status": "ok", "dataset_id": "ds_a1b2c3d4", "shape": [1000, 15], ...}

// Error
{"status": "error", "error_type": "not_found", "message": "Dataset 'ds_xxx' not found", "hint": "Use load_data() first"}

Error types: not_found, validation, missing_dependency, timeout, internal.

Configuration¶

Environment Variable	Default	Description
`ENDGAME_MCP_WORKDIR`	`/tmp/endgame_mcp`	Working directory for output files
`ENDGAME_MCP_TIMEOUT`	`600`	Max seconds for training operations before timeout

Troubleshooting¶

Categorical features produce wrong predictions¶

Endgame’s MCP server stores fitted label encoders from training and reuses them during evaluation and prediction. This ensures that categorical values like "red" -> 0, "green" -> 1 are encoded consistently across the entire pipeline. If you see unexpected predictions on categorical data, verify you are using a model trained through the MCP server (which stores encoders automatically).

Categories that appear in test data but were not seen during training are encoded as -1.

Forecasting fails with “missing_dependency”¶

The forecast tool requires statsforecast for ARIMA, ETS, and Theta methods. Install it with:

pip install statsforecast

The naive method works without extra dependencies and returns the last observed value repeated for the forecast horizon.

Training hangs or takes too long¶

Training operations have a configurable timeout (default: 10 minutes). Set a custom timeout via the ENDGAME_MCP_TIMEOUT environment variable:

ENDGAME_MCP_TIMEOUT=300 python -m endgame.mcp  # 5-minute timeout

If training consistently times out, try:

A simpler model (e.g., lgbm instead of ft_transformer)
A smaller dataset (use sample_n parameter in load_data)
The fast preset for automl

ROC/PR curves fail on multiclass problems¶

ROC curves and PR curves require binary classification. For multiclass problems, use confusion_matrix instead:

create_visualization(chart_type="confusion_matrix", model_id="model_...")

Server stdout corruption¶

If you see garbled output or JSON parse errors, ensure no Endgame code is printing to stdout. The MCP server redirects stdout to stderr during tool calls, but any code running outside tool calls could corrupt the stdio transport. Use the --sse flag for debugging:

python -m endgame.mcp --sse