Clustering

class endgame.clustering.KMeansClusterer(n_clusters=8, init='k-means++', n_init=20, max_iter=300, tol=0.0001, algorithm='lloyd', random_state=None, n_jobs=-1)[source]

Bases: BaseEstimator, ClusterMixin

K-Means / K-Means++ clustering with competition-tuned defaults.

Wraps sklearn’s KMeans with higher n_init for stability and k-means++ initialization by default.

Parameters:
  • n_clusters (int, default=8) – Number of clusters.

  • init (str or ndarray, default='k-means++') – Initialization method: ‘k-means++’, ‘random’, or centroid array.

  • n_init (int, default=20) – Number of initializations (higher than sklearn’s 10 for stability).

  • max_iter (int, default=300) – Maximum iterations per run.

  • tol (float, default=1e-4) – Convergence tolerance.

  • algorithm (str, default='lloyd') – Algorithm: ‘lloyd’ or ‘elkan’.

  • random_state (int or None, default=None) – Random seed.

  • n_jobs (int, default=-1) – Parallel jobs.

labels_

Cluster labels.

Type:

ndarray of shape (n_samples,)

cluster_centers_

Centroids.

Type:

ndarray of shape (n_clusters, n_features)

inertia_

Sum of squared distances to nearest centroid.

Type:

float

n_iter_

Number of iterations run.

Type:

int

fit(X, y=None)[source]

Fit K-Means.

Return type:

KMeansClusterer

Parameters:

X (ArrayLike)

predict(X)[source]

Predict cluster labels for new data.

Return type:

ndarray

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

transform(X)[source]

Transform X to cluster-distance space.

Return type:

ndarray

Parameters:

X (ArrayLike)

class endgame.clustering.MiniBatchKMeansClusterer(n_clusters=8, batch_size=1024, init='k-means++', n_init=10, max_iter=300, max_no_improvement=10, random_state=None)[source]

Bases: BaseEstimator, ClusterMixin

Mini-Batch K-Means for large-scale clustering.

Trades small accuracy loss for massive speed gains on datasets >100K by using random mini-batches instead of full passes.

Parameters:
  • n_clusters (int, default=8) – Number of clusters.

  • batch_size (int, default=1024) – Mini-batch size.

  • init (str, default='k-means++') – Initialization method.

  • n_init (int, default=10) – Number of initializations.

  • max_iter (int, default=300) – Maximum iterations.

  • max_no_improvement (int, default=10) – Early stopping patience.

  • random_state (int or None, default=None) – Random seed.

labels_

Cluster labels.

Type:

ndarray of shape (n_samples,)

cluster_centers_

Centroids.

Type:

ndarray of shape (n_clusters, n_features)

inertia_

Sum of squared distances.

Type:

float

fit(X, y=None)[source]

Fit Mini-Batch K-Means.

Return type:

MiniBatchKMeansClusterer

Parameters:

X (ArrayLike)

predict(X)[source]

Predict cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

partial_fit(X, y=None)[source]

Incremental fit on a batch of data.

Return type:

MiniBatchKMeansClusterer

Parameters:

X (ArrayLike)

class endgame.clustering.KStarMeansClusterer(k_init=2, k_max=50, max_splits=20, max_iter=300, random_state=None)[source]

Bases: BaseEstimator, ClusterMixin

k*-Means: automatic k determination via Minimum Description Length.

Extends K-Means by splitting and merging clusters based on MDL cost. Starts with k_init clusters and iteratively splits clusters that reduce description length and merges clusters that increase it.

Parameters:
  • k_init (int, default=2) – Initial number of clusters.

  • k_max (int, default=50) – Maximum number of clusters to consider.

  • max_splits (int, default=20) – Maximum split/merge iterations.

  • max_iter (int, default=300) – K-Means iterations per refinement step.

  • random_state (int or None, default=None) – Random seed.

labels_

Cluster labels.

Type:

ndarray of shape (n_samples,)

cluster_centers_

Centroids.

Type:

ndarray of shape (k_optimal, n_features)

n_clusters_

Optimal number of clusters found.

Type:

int

mdl_history_

MDL cost at each iteration.

Type:

list of float

References

k*-Means (2025): automatic k via MDL sub-cluster splitting.

fit(X, y=None)[source]

Fit k*-Means with automatic k selection.

Return type:

KStarMeansClusterer

Parameters:

X (ArrayLike)

predict(X)[source]

Predict cluster labels for new data.

Return type:

ndarray

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

class endgame.clustering.DBSCANClusterer(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, n_jobs=-1)[source]

Bases: BaseEstimator, ClusterMixin

DBSCAN density-based clustering with competition defaults.

Finds arbitrary-shaped clusters and labels noise points as -1.

Parameters:
  • eps (float, default=0.5) – Neighbourhood radius.

  • min_samples (int, default=5) – Minimum samples in a neighbourhood for a core point.

  • metric (str, default='euclidean') – Distance metric.

  • algorithm (str, default='auto') – Nearest neighbours algorithm: ‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’.

  • leaf_size (int, default=30) – Leaf size for tree-based algorithms.

  • n_jobs (int, default=-1) – Parallel jobs.

labels_

Cluster labels (-1 for noise).

Type:

ndarray of shape (n_samples,)

core_sample_indices_

Indices of core samples.

Type:

ndarray

n_clusters_

Number of clusters found (excluding noise).

Type:

int

fit(X, y=None)[source]

Fit DBSCAN.

Return type:

DBSCANClusterer

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

class endgame.clustering.HDBSCANClusterer(min_cluster_size=15, min_samples=None, metric='euclidean', cluster_selection_method='eom', cluster_selection_epsilon=0.0, alpha=1.0, allow_single_cluster=False, n_jobs=-1)[source]

Bases: BaseEstimator, ClusterMixin

HDBSCAN hierarchical density-based clustering.

Runs DBSCAN across all eps values simultaneously via mutual reachability MST, extracting the most stable clusters. Only real param is min_cluster_size. Handles variable-density clusters.

Uses sklearn’s HDBSCAN (>=1.3) with fallback to the hdbscan package.

Parameters:
  • min_cluster_size (int, default=15) – Minimum cluster size.

  • min_samples (int or None, default=None) – Core distance samples. Defaults to min_cluster_size.

  • metric (str, default='euclidean') – Distance metric.

  • cluster_selection_method (str, default='eom') – Cluster extraction: ‘eom’ (Excess of Mass) or ‘leaf’.

  • cluster_selection_epsilon (float, default=0.0) – Distance threshold for merging clusters.

  • alpha (float, default=1.0) – Mutual reachability smoothing.

  • allow_single_cluster (bool, default=False) – Whether to allow a single-cluster result.

  • n_jobs (int, default=-1) – Parallel jobs.

labels_

Cluster labels (-1 for noise).

Type:

ndarray of shape (n_samples,)

probabilities_

Cluster membership probabilities.

Type:

ndarray of shape (n_samples,)

n_clusters_

Number of clusters found.

Type:

int

fit(X, y=None)[source]

Fit HDBSCAN.

Return type:

HDBSCANClusterer

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

class endgame.clustering.OPTICSClusterer(min_samples=5, max_eps=inf, metric='minkowski', p=2, cluster_method='xi', xi=0.05, min_cluster_size=None, n_jobs=-1)[source]

Bases: BaseEstimator, ClusterMixin

OPTICS ordering-based clustering.

Produces a reachability plot and extracts clusters, generalizing DBSCAN to handle varying density.

Parameters:
  • min_samples (int or float, default=5) – Core distance parameter.

  • max_eps (float, default=inf) – Maximum neighbourhood radius.

  • metric (str, default='minkowski') – Distance metric.

  • p (float, default=2) – Minkowski power (2 = Euclidean).

  • cluster_method (str, default='xi') – Extraction method: ‘xi’ or ‘dbscan’.

  • xi (float, default=0.05) – Steepness threshold for xi extraction.

  • min_cluster_size (int or float or None, default=None) – Minimum cluster size for extraction.

  • n_jobs (int, default=-1) – Parallel jobs.

labels_

Cluster labels (-1 for noise).

Type:

ndarray of shape (n_samples,)

reachability_

Reachability distances.

Type:

ndarray of shape (n_samples,)

ordering_

OPTICS ordering.

Type:

ndarray of shape (n_samples,)

n_clusters_

Number of clusters found.

Type:

int

fit(X, y=None)[source]

Fit OPTICS.

Return type:

OPTICSClusterer

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

class endgame.clustering.DensityPeaksClusterer(n_clusters=None, percent=2.0, gamma_threshold=None, metric='euclidean', random_state=None)[source]

Bases: BaseEstimator, ClusterMixin

Density Peaks Clustering (DPC).

Cluster centres are points with simultaneously high local density (rho) and large distance to any denser point (delta). Points are assigned by following the chain to the nearest denser neighbour.

Parameters:
  • n_clusters (int or None, default=None) – Number of clusters. If None, auto-select from the decision graph using gamma_threshold.

  • percent (float, default=2.0) – Percentage of data to use as cutoff distance (d_c) for density estimation. E.g. 2.0 means d_c is the distance at the 2nd percentile of all pairwise distances.

  • gamma_threshold (float or None, default=None) – If n_clusters is None, points with rho * delta above this threshold are chosen as centres. If None, uses Otsu-like thresholding on gamma values.

  • metric (str, default='euclidean') – Distance metric.

  • random_state (int or None, default=None) – Random seed (for tie-breaking).

labels_

Cluster labels.

Type:

ndarray of shape (n_samples,)

rho_

Local densities.

Type:

ndarray of shape (n_samples,)

delta_

Distance to nearest denser point.

Type:

ndarray of shape (n_samples,)

centers_

Indices of cluster centres.

Type:

ndarray of shape (n_centers,)

n_clusters_

Number of clusters found.

Type:

int

References

Rodriguez & Laio, “Clustering by fast search and find of density peaks”, Science, 2014.

fit(X, y=None)[source]

Fit DPC.

Return type:

DensityPeaksClusterer

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

class endgame.clustering.AgglomerativeClusterer(n_clusters=2, linkage='ward', metric='euclidean', distance_threshold=None, connectivity=None, compute_full_tree='auto', compute_distances=False)[source]

Bases: BaseEstimator, ClusterMixin

Agglomerative hierarchical clustering with multiple linkage options.

Ward’s linkage (default) is the strongest general-purpose option. Average linkage is robust. Single linkage is fast but chaining-sensitive. Complete linkage produces compact clusters.

Parameters:
  • n_clusters (int or None, default=2) – Number of clusters. If None, must provide distance_threshold.

  • linkage (str, default='ward') – Linkage criterion: ‘ward’, ‘average’, ‘complete’, ‘single’.

  • metric (str, default='euclidean') – Distance metric (only used with non-ward linkage).

  • distance_threshold (float or None, default=None) – Distance threshold for stopping. If set, n_clusters must be None.

  • connectivity (array-like or callable or None, default=None) – Connectivity constraints.

  • compute_full_tree (bool or 'auto', default='auto') – Whether to compute the full dendrogram.

  • compute_distances (bool, default=False) – Whether to compute distances between clusters.

labels_

Cluster labels.

Type:

ndarray of shape (n_samples,)

n_clusters_

Number of clusters.

Type:

int

n_leaves_

Number of leaves in the dendrogram.

Type:

int

children_

Merge history.

Type:

ndarray of shape (n_nodes-1, 2)

distances_

Distances between merged clusters (if compute_distances=True).

Type:

ndarray or None

fit(X, y=None)[source]

Fit agglomerative clustering.

Return type:

AgglomerativeClusterer

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

class endgame.clustering.GaussianMixtureClusterer(n_components=8, covariance_type='full', n_init=5, max_iter=200, tol=0.001, reg_covar=1e-06, init_params='k-means++', random_state=None)[source]

Bases: BaseEstimator, ClusterMixin

Gaussian Mixture Model clustering.

Fits k Gaussians via EM. The probabilistic analog of K-Means — gives soft assignments and handles elliptical clusters. Supports BIC/AIC for model selection.

Parameters:
  • n_components (int, default=8) – Number of mixture components.

  • covariance_type (str, default='full') – Covariance type: ‘full’, ‘tied’, ‘diag’, ‘spherical’.

  • n_init (int, default=5) – Number of EM initializations.

  • max_iter (int, default=200) – Maximum EM iterations.

  • tol (float, default=1e-3) – Convergence tolerance.

  • reg_covar (float, default=1e-6) – Covariance regularization.

  • init_params (str, default='k-means++') – Initialization: ‘kmeans’, ‘k-means++’, ‘random’, ‘random_from_data’.

  • random_state (int or None, default=None) – Random seed.

labels_

Hard cluster assignments (argmax of responsibilities).

Type:

ndarray of shape (n_samples,)

probabilities_

Soft assignment probabilities (responsibilities).

Type:

ndarray of shape (n_samples, n_components)

means_

Component means.

Type:

ndarray of shape (n_components, n_features)

covariances_

Component covariances.

Type:

ndarray

weights_

Mixing weights.

Type:

ndarray of shape (n_components,)

bic_

Bayesian Information Criterion of the fitted model.

Type:

float

aic_

Akaike Information Criterion of the fitted model.

Type:

float

fit(X, y=None)[source]

Fit GMM.

Return type:

GaussianMixtureClusterer

Parameters:

X (ArrayLike)

predict(X)[source]

Predict hard cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

predict_proba(X)[source]

Predict soft cluster probabilities.

Return type:

ndarray

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return hard cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

score(X)[source]

Return average log-likelihood.

Return type:

float

Parameters:

X (ArrayLike)

select_n_components(X, k_range=None, criterion='bic')[source]

Select optimal n_components via BIC or AIC.

Parameters:
  • X (array-like) – Data to evaluate.

  • k_range (range or None, default=None) – Range of k values. Defaults to range(1, 21).

  • criterion (str, default='bic') – Selection criterion: ‘bic’ or ‘aic’.

Return type:

int

Returns:

int – Optimal number of components.

class endgame.clustering.FuzzyCMeansClusterer(n_clusters=8, m=2.0, max_iter=300, tol=0.0001, random_state=None)[source]

Bases: BaseEstimator, ClusterMixin

Fuzzy C-Means clustering.

Soft version of K-Means where each point has a degree of membership in each cluster. Useful when clusters genuinely overlap.

Parameters:
  • n_clusters (int, default=8) – Number of clusters.

  • m (float, default=2.0) – Fuzziness coefficient (m > 1). Higher values = softer assignments. m = 1 approaches hard K-Means; m >> 1 approaches uniform membership.

  • max_iter (int, default=300) – Maximum iterations.

  • tol (float, default=1e-4) – Convergence tolerance on membership matrix change.

  • random_state (int or None, default=None) – Random seed.

labels_

Hard cluster labels (argmax of membership).

Type:

ndarray of shape (n_samples,)

membership_

Fuzzy membership matrix.

Type:

ndarray of shape (n_samples, n_clusters)

cluster_centers_

Cluster centroids.

Type:

ndarray of shape (n_clusters, n_features)

n_iter_

Number of iterations run.

Type:

int

fit(X, y=None)[source]

Fit Fuzzy C-Means.

Return type:

FuzzyCMeansClusterer

Parameters:

X (ArrayLike)

predict(X)[source]

Predict hard cluster labels for new data.

Return type:

ndarray

Parameters:

X (ArrayLike)

predict_memberships(X)[source]

Predict fuzzy membership for new data.

Parameters:

X (array-like of shape (n_samples, n_features))

Return type:

ndarray

Returns:

ndarray of shape (n_samples, n_clusters) – Membership matrix.

fit_predict(X, y=None)[source]

Fit and return hard cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

class endgame.clustering.SpectralClusterer(n_clusters=8, affinity='rbf', gamma=None, n_neighbors=10, n_init=10, assign_labels='kmeans', random_state=None, n_jobs=-1)[source]

Bases: BaseEstimator, ClusterMixin

Spectral clustering via graph Laplacian eigenvectors.

Constructs a similarity graph, computes eigenvectors of the graph Laplacian, then runs k-means in the spectral embedding. Excels at non-convex clusters (concentric circles, spirals).

Parameters:
  • n_clusters (int, default=8) – Number of clusters.

  • affinity (str, default='rbf') – Similarity measure: ‘rbf’, ‘nearest_neighbors’, ‘precomputed’.

  • gamma (float or None, default=None) – RBF kernel bandwidth. If None, uses 1/n_features.

  • n_neighbors (int, default=10) – Number of neighbours for ‘nearest_neighbors’ affinity.

  • n_init (int, default=10) – k-means initializations in spectral space.

  • assign_labels (str, default='kmeans') – Label assignment: ‘kmeans’ or ‘discretize’.

  • random_state (int or None, default=None) – Random seed.

  • n_jobs (int, default=-1) – Parallel jobs.

labels_

Cluster labels.

Type:

ndarray of shape (n_samples,)

affinity_matrix_

Computed affinity matrix.

Type:

ndarray of shape (n_samples, n_samples)

n_clusters_

Number of clusters.

Type:

int

fit(X, y=None)[source]

Fit spectral clustering.

Return type:

SpectralClusterer

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

class endgame.clustering.AffinityPropagationClusterer(damping=0.5, max_iter=200, convergence_iter=15, preference=None, affinity='euclidean', random_state=None)[source]

Bases: BaseEstimator, ClusterMixin

Affinity Propagation clustering via message passing.

Simultaneously chooses exemplars and assigns points via responsibility and availability messages. No k required.

Parameters:
  • damping (float, default=0.5) – Damping factor (0.5 to 1). Higher = more stable but slower.

  • max_iter (int, default=200) – Maximum message-passing iterations.

  • convergence_iter (int, default=15) – Iterations without change for convergence.

  • preference (float or array-like or None, default=None) – Preference for each point to be an exemplar. Larger = more clusters. None uses the median of the similarity matrix.

  • affinity (str, default='euclidean') – Affinity type: ‘euclidean’ or ‘precomputed’.

  • random_state (int or None, default=None) – Random seed.

labels_

Cluster labels.

Type:

ndarray of shape (n_samples,)

cluster_centers_indices_

Indices of exemplar points.

Type:

ndarray

cluster_centers_

Exemplar coordinates.

Type:

ndarray of shape (n_clusters, n_features)

n_clusters_

Number of clusters found.

Type:

int

n_iter_

Iterations run.

Type:

int

fit(X, y=None)[source]

Fit Affinity Propagation.

Return type:

AffinityPropagationClusterer

Parameters:

X (ArrayLike)

predict(X)[source]

Predict cluster labels for new data.

Return type:

ndarray

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

class endgame.clustering.BIRCHClusterer(n_clusters=3, threshold=0.5, branching_factor=50, compute_labels=True)[source]

Bases: BaseEstimator, ClusterMixin

BIRCH incremental hierarchical clustering.

Builds a CF-tree (Clustering Feature tree) for incremental clustering. Designed for very large datasets or streaming scenarios.

Parameters:
  • n_clusters (int or None, default=3) – Final number of clusters. If None, the subclusters from the CF-tree leaf nodes are returned directly.

  • threshold (float, default=0.5) – CF-tree leaf radius threshold.

  • branching_factor (int, default=50) – Maximum CF entries per node.

  • compute_labels (bool, default=True) – Whether to compute labels for training data.

labels_

Cluster labels.

Type:

ndarray of shape (n_samples,)

subcluster_centers_

CF-tree subcluster centres.

Type:

ndarray

n_clusters_

Number of clusters.

Type:

int

fit(X, y=None)[source]

Fit BIRCH.

Return type:

BIRCHClusterer

Parameters:

X (ArrayLike)

predict(X)[source]

Predict cluster labels for new data.

Return type:

ndarray

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

partial_fit(X, y=None)[source]

Incremental fit on a batch of data.

Return type:

BIRCHClusterer

Parameters:

X (ArrayLike)

class endgame.clustering.MeanShiftClusterer(bandwidth=None, bin_seeding=False, min_bin_freq=1, cluster_all=True, n_jobs=-1)[source]

Bases: BaseEstimator, ClusterMixin

Mean Shift mode-finding clustering.

Non-parametric mode finding via kernel density gradient ascent. Automatically determines k by finding density modes.

Parameters:
  • bandwidth (float or None, default=None) – Kernel bandwidth. If None, estimated automatically.

  • bin_seeding (bool, default=False) – Speed up by discretising seed points.

  • min_bin_freq (int, default=1) – Minimum bin frequency for seeding.

  • cluster_all (bool, default=True) – If False, orphan points get label -1.

  • n_jobs (int, default=-1) – Parallel jobs.

labels_

Cluster labels.

Type:

ndarray of shape (n_samples,)

cluster_centers_

Mode locations.

Type:

ndarray of shape (n_clusters, n_features)

n_clusters_

Number of clusters found.

Type:

int

fit(X, y=None)[source]

Fit Mean Shift.

Return type:

MeanShiftClusterer

Parameters:

X (ArrayLike)

predict(X)[source]

Predict cluster labels for new data.

Return type:

ndarray

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

class endgame.clustering.AutoCluster(n_clusters='auto', detect_noise=False, prefer=None, random_state=None, verbose=False, **kwargs)[source]

Bases: BaseEstimator, ClusterMixin

Automatic clustering with method selection based on data properties.

Selects the best clustering algorithm based on: - Dataset size (n) - Dimensionality (d) - Whether k is specified - Whether noise detection is needed

Parameters:
  • n_clusters (int or 'auto', default='auto') – Number of clusters. ‘auto’ uses algorithms that determine k automatically (HDBSCAN, k*-Means, or GMM with BIC).

  • detect_noise (bool, default=False) – Whether to detect noise/outlier points (label -1). If True, prefers density-based methods (HDBSCAN, DBSCAN).

  • prefer (str or None, default=None) – Override automatic selection: ‘centroid’, ‘density’, ‘hierarchical’, ‘distribution’, ‘spectral’. If None, auto-selects.

  • random_state (int or None, default=None) – Random seed.

  • verbose (bool, default=False) – Enable verbose output.

  • **kwargs – Additional parameters passed to the selected clusterer.

labels_

Cluster labels.

Type:

ndarray of shape (n_samples,)

selected_method_

Name of the selected algorithm.

Type:

str

clusterer_

The fitted clusterer instance.

Type:

BaseEstimator

n_clusters_

Number of clusters found.

Type:

int

Examples

>>> from endgame.clustering import AutoCluster
>>> ac = AutoCluster(n_clusters='auto', detect_noise=True)
>>> labels = ac.fit_predict(X)
>>> print(f"Selected: {ac.selected_method_}, k={ac.n_clusters_}")
fit(X, y=None)[source]

Fit the auto-selected clusterer.

Parameters:
Return type:

AutoCluster

Returns:

self

predict(X)[source]

Predict cluster labels for new data (if supported).

Parameters:

X (array-like of shape (n_samples, n_features))

Return type:

ndarray

Returns:

ndarray of shape (n_samples,)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

class endgame.clustering.GenieClusterer(n_clusters=2, gini_threshold=0.3, affinity='euclidean', exact=True, compute_full_tree=True, M=1)[source]

Bases: BaseEstimator, ClusterMixin

Genie clustering: MST-based with Gini index threshold.

Builds a minimum spanning tree and merges clusters using single linkage, but applies a Gini index threshold on cluster sizes to prevent the pathological chaining behavior. Consistently outperforms Ward and average linkage on standard benchmarks.

Requires the genieclust package.

Parameters:
  • n_clusters (int, default=2) – Number of clusters.

  • gini_threshold (float, default=0.3) – Gini index threshold for cluster size inequality. Lower values enforce more balanced clusters. 0 = single linkage, 1 = balanced.

  • affinity (str, default='euclidean') – Distance metric.

  • exact (bool, default=True) – Use exact (True) or approximate (False) algorithm.

  • compute_full_tree (bool, default=True) – Whether to compute the full hierarchy.

  • M (int, default=1) – Smoothing factor for the mutual reachability distance. M=1 is standard MST; larger M approaches HDBSCAN*-like behavior.

labels_

Cluster labels.

Type:

ndarray of shape (n_samples,)

n_clusters_

Number of clusters.

Type:

int

References

Gagolewski, M. (2016). “Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm.” Information Sciences. Gagolewski, M. (2025). Journal of Classification.

fit(X, y=None)[source]

Fit Genie clustering.

Return type:

GenieClusterer

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)

class endgame.clustering.FINCHClusterer(req_clust=None, distance='euclidean', verbose=False)[source]

Bases: BaseEstimator, ClusterMixin

FINCH: First Integer Neighbour Clustering Hierarchy.

Zero-parameter clustering that uses first-neighbour relations to recursively merge clusters in O(n log n) with O(n) memory. Produces a hierarchy of partitions in 4-10 steps.

Requires the finch-clust package.

Parameters:
  • req_clust (int or None, default=None) – Requested number of clusters. If None, returns the partition at the first hierarchy level where all points are in the same cluster (i.e. the finest reasonable partition).

  • distance (str, default='euclidean') – Distance metric: ‘euclidean’ or ‘cosine’.

  • verbose (bool, default=False) – Print hierarchy information.

labels_

Cluster labels at the selected partition level.

Type:

ndarray of shape (n_samples,)

all_partitions_

All hierarchy levels.

Type:

ndarray of shape (n_samples, n_levels)

n_clusters_

Number of clusters at the selected level.

Type:

int

n_levels_

Number of hierarchy levels found.

Type:

int

References

Sarfraz et al., “Efficient Parameter-Free Clustering Using First Neighbor Relations”, CVPR 2019.

fit(X, y=None)[source]

Fit FINCH.

Return type:

FINCHClusterer

Parameters:

X (ArrayLike)

fit_predict(X, y=None)[source]

Fit and return cluster labels.

Return type:

ndarray

Parameters:

X (ArrayLike)