MCP Server¶
Endgame ships an MCP server that lets any MCP-compatible LLM host (Claude Code, Claude Desktop, VS Code Copilot, etc.) build ML pipelines through natural language.
Instead of registering 300+ tools, the server exposes 20 meta-tools and 6 resources — keeping schema overhead under 2K tokens while giving the LLM full access to the toolkit.
Installation¶
pip install endgame-ml[mcp]
# or, if already installed:
pip install "mcp>=1.2.0"
Setup¶
Claude Code¶
Add .mcp.json to your project root (Endgame ships one by default):
{
"mcpServers": {
"endgame": {
"command": "/path/to/your/.venv/bin/python",
"args": ["-m", "endgame.mcp"]
}
}
}
Restart Claude Code. The server auto-starts on first tool call.
Claude Desktop¶
Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):
{
"mcpServers": {
"endgame": {
"command": "/path/to/your/.venv/bin/python",
"args": ["-m", "endgame.mcp"]
}
}
}
Manual / Standalone¶
# stdio transport (default — used by MCP hosts)
python -m endgame.mcp
# SSE transport (for web-based clients)
python -m endgame.mcp --sse
How It Works¶
The LLM never sees 300+ model definitions. Instead:
Resources (zero-cost) let the LLM browse the model catalog, presets, metrics, and visualizers without a tool call
Discovery tools help the LLM find the right model for the dataset
Action tools load data, train, evaluate, visualize, and export
A SessionManager tracks loaded datasets, trained models, and artifacts across tool calls via short IDs (
ds_a1b2c3d4,model_e5f6g7h8)
User: "Build a classifier to predict loan defaults"
→ LLM reads endgame://catalog/models (browse 97 models)
→ LLM calls load_data(source="loans.csv", target_column="default")
→ LLM calls recommend_models(dataset_id="ds_...", time_budget="medium")
→ LLM calls train_model(dataset_id="ds_...", model_name="lgbm")
→ LLM calls evaluate_model(model_id="model_...")
→ LLM calls create_visualization(chart_type="roc_curve", model_id="model_...")
→ LLM calls export_script(model_id="model_...")
Tools Reference¶
Data (3 tools)¶
Tool |
Purpose |
|---|---|
|
Load from CSV/Parquet/URL/OpenML. Auto-detects task type. Returns dataset ID. |
|
Explore a dataset: summary, describe, correlations, missing, distribution, head, dtypes. |
|
Create stratified train/test splits. Returns two new dataset IDs. |
load_data parameters:
source— File path, URL, or"openml:31"/"openml:credit-g"target_column— Name of the target columnname— Optional display namesample_n— Subsample to N rows
inspect_data operations:
summary— Shape, dtypes, missing values, meta-featuresdescribe— Statistical summary (mean, std, quartiles)correlations— Top 20 pairwise correlationsmissing— Missing value counts and percentagesdistribution— Value counts or quantile stats for a columnhead— First 10 rowsdtypes— Column data types
Discovery (3 tools)¶
Tool |
Purpose |
|---|---|
|
Search available models by task type, family, interpretability, speed. |
|
Smart recommendations based on dataset meta-features and time budget. |
|
Full metadata for a model (params, capabilities, speed, notes). |
list_models filters:
task_type—"classification"or"regression"family—"gbdt","neural","tree","linear","kernel","rules","bayesian","foundation","ensemble"interpretable_only— Only glass-box modelsfast_only— Exclude slow/very_slow modelsmax_samples— Only models that scale to N samples
Training (3 tools)¶
Tool |
Purpose |
|---|---|
|
Train a single model with cross-validation. Returns model ID + metrics. |
|
Full AutoML pipeline (preprocessing → training → ensembling). |
|
Quick multi-model comparison with leaderboard. |
train_model parameters:
dataset_id— Fromload_datamodel_name— Registry key (e.g."lgbm","xgb","ebm")params— JSON string of hyperparameter overrides:'{"n_estimators": 500}'cv_folds— Number of CV folds (default 5)metric— Evaluation metric (default"auto")
automl presets: best_quality, high_quality, good_quality, medium_quality, fast, interpretable
Evaluation (2 tools)¶
Tool |
Purpose |
|---|---|
|
Compute metrics on test data or OOF predictions. |
|
Feature importance ( |
evaluate_model metrics (comma-separated string):
Classification:
accuracy,roc_auc,f1,precision,recall,balanced_accuracy,log_loss,matthews_corrcoef,cohen_kappaRegression:
rmse,r2,mae,mape,median_ae,max_error,explained_variance
Prediction (1 tool)¶
Tool |
Purpose |
|---|---|
|
Generate predictions, optionally save to CSV. Supports probabilities. |
Preprocessing (1 tool)¶
Tool |
Purpose |
|---|---|
|
Chain preprocessing operations. Returns a new dataset ID. |
Operations (JSON array):
[
{"type": "impute", "strategy": "median"},
{"type": "scale", "method": "standard"},
{"type": "encode", "method": "label"},
{"type": "balance", "method": "smote"},
{"type": "select_features", "method": "mutual_info", "top_k": 20},
{"type": "drop_columns", "columns": ["id", "name"]}
]
Visualization (2 tools)¶
Tool |
Purpose |
|---|---|
|
Generate a self-contained HTML chart. |
|
Full classification or regression evaluation report. |
Chart types:
ML evaluation:
roc_curve,pr_curve,confusion_matrix,calibration_plot,lift_chart,feature_importanceData exploration:
histogram,scatterplot,heatmap,box_plot,bar_chart,line_chart
Export (2 tools)¶
Tool |
Purpose |
|---|---|
|
Generate a standalone Python script reproducing the pipeline. |
|
Save trained model to disk ( |
Advanced (3 tools)¶
Tool |
Purpose |
|---|---|
|
Clustering: |
|
Outlier detection: |
|
Time series forecasting: |
Resources Reference¶
Resources are read-only catalogs the LLM can browse without making a tool call — zero overhead for discovery.
URI |
Content |
|---|---|
|
All 97 models grouped by family with name, fit time, and description |
|
6 AutoML presets with time limits, model pools, and settings |
|
Available chart types with required inputs |
|
Classification + regression metrics with descriptions |
|
Current loaded datasets, trained models, and visualizations |
|
Example workflows for common ML tasks |
Example Workflows¶
Train a single model¶
You: Load iris.csv and train a LightGBM classifier
LLM calls:
load_data(source="iris.csv", target_column="species")
train_model(dataset_id="ds_...", model_name="lgbm")
evaluate_model(model_id="model_...")
Full AutoML¶
You: Run AutoML on my dataset with high quality
LLM calls:
load_data(source="data.csv", target_column="label")
automl(dataset_id="ds_...", preset="high_quality")
Interpretable pipeline¶
You: I need an interpretable model for regulatory compliance
LLM calls:
load_data(source="loans.csv", target_column="default")
list_models(task_type="classification", interpretable_only=true)
train_model(dataset_id="ds_...", model_name="ebm")
explain_model(model_id="model_...", method="importance")
create_report(model_id="model_...")
export_script(model_id="model_...")
Data exploration¶
You: Explore this dataset and show me the correlations
LLM calls:
load_data(source="housing.csv", target_column="price")
inspect_data(dataset_id="ds_...", operation="summary")
inspect_data(dataset_id="ds_...", operation="correlations")
create_visualization(chart_type="heatmap", dataset_id="ds_...")
create_visualization(chart_type="histogram", dataset_id="ds_...", params='{"column": "price"}')
Preprocessing + training¶
You: Impute missing values, scale features, then train XGBoost
LLM calls:
load_data(source="messy_data.csv", target_column="outcome")
preprocess(dataset_id="ds_...", operations='[{"type":"impute","strategy":"median"},{"type":"scale","method":"standard"}]')
train_model(dataset_id="ds_preprocessed_...", model_name="xgb")
Session Management¶
Every artifact gets a short ID:
Datasets:
ds_a1b2c3d4Models:
model_e5f6g7h8Visualizations:
viz_i9j0k1l2
These IDs are passed between tools to chain operations. The endgame://session/state resource shows all current artifacts at any time.
Artifacts live in memory for the duration of the server process. Files (visualizations, exported scripts, saved models) are written to the working directory (/tmp/endgame_mcp by default, configurable via ENDGAME_MCP_WORKDIR).
Error Handling¶
All tools return structured JSON with consistent format:
// Success
{"status": "ok", "dataset_id": "ds_a1b2c3d4", "shape": [1000, 15], ...}
// Error
{"status": "error", "error_type": "not_found", "message": "Dataset 'ds_xxx' not found", "hint": "Use load_data() first"}
Error types: not_found, validation, missing_dependency, timeout, internal.
Configuration¶
Environment Variable |
Default |
Description |
|---|---|---|
|
|
Working directory for output files |
|
|
Max seconds for training operations before timeout |
Troubleshooting¶
Categorical features produce wrong predictions¶
Endgame’s MCP server stores fitted label encoders from training and reuses them during evaluation and prediction. This ensures that categorical values like "red" -> 0, "green" -> 1 are encoded consistently across the entire pipeline. If you see unexpected predictions on categorical data, verify you are using a model trained through the MCP server (which stores encoders automatically).
Categories that appear in test data but were not seen during training are encoded as -1.
Forecasting fails with “missing_dependency”¶
The forecast tool requires statsforecast for ARIMA, ETS, and Theta methods. Install it with:
pip install statsforecast
The naive method works without extra dependencies and returns the last observed value repeated for the forecast horizon.
Training hangs or takes too long¶
Training operations have a configurable timeout (default: 10 minutes). Set a custom timeout via the ENDGAME_MCP_TIMEOUT environment variable:
ENDGAME_MCP_TIMEOUT=300 python -m endgame.mcp # 5-minute timeout
If training consistently times out, try:
A simpler model (e.g.,
lgbminstead offt_transformer)A smaller dataset (use
sample_nparameter inload_data)The
fastpreset forautoml
ROC/PR curves fail on multiclass problems¶
ROC curves and PR curves require binary classification. For multiclass problems, use confusion_matrix instead:
create_visualization(chart_type="confusion_matrix", model_id="model_...")
Server stdout corruption¶
If you see garbled output or JSON parse errors, ensure no Endgame code is printing to stdout. The MCP server redirects stdout to stderr during tool calls, but any code running outside tool calls could corrupt the stdio transport. Use the --sse flag for debugging:
python -m endgame.mcp --sse