Clustering¶

class endgame.clustering.KMeansClusterer(n_clusters=8, init='k-means++', n_init=20, max_iter=300, tol=0.0001, algorithm='lloyd', random_state=None, n_jobs=-1)[source]¶

Bases: BaseEstimator, ClusterMixin

K-Means / K-Means++ clustering with competition-tuned defaults.

Wraps sklearn’s KMeans with higher n_init for stability and k-means++ initialization by default.

Parameters:

n_clusters (int, default=8) – Number of clusters.
init (str or ndarray, default='k-means++') – Initialization method: ‘k-means++’, ‘random’, or centroid array.
n_init (int, default=20) – Number of initializations (higher than sklearn’s 10 for stability).
max_iter (int, default=300) – Maximum iterations per run.
tol (float, default=1e-4) – Convergence tolerance.
algorithm (str, default='lloyd') – Algorithm: ‘lloyd’ or ‘elkan’.
random_state (int or None, default=None) – Random seed.
n_jobs (int, default=-1) – Parallel jobs.

labels_¶

Cluster labels.

Type:: ndarray of shape (n_samples,)

cluster_centers_¶

Centroids.

Type:: ndarray of shape (n_clusters, n_features)

inertia_¶

Sum of squared distances to nearest centroid.

Type:: float

n_iter_¶

Number of iterations run.

Type:: int

fit(X, y=None)[source]¶

Fit K-Means.

Return type:: KMeansClusterer
Parameters:: X (ArrayLike)

predict(X)[source]¶

Predict cluster labels for new data.

Return type:: ndarray
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

transform(X)[source]¶

Transform X to cluster-distance space.

Return type:: ndarray
Parameters:: X (ArrayLike)

class endgame.clustering.MiniBatchKMeansClusterer(n_clusters=8, batch_size=1024, init='k-means++', n_init=10, max_iter=300, max_no_improvement=10, random_state=None)[source]¶

Bases: BaseEstimator, ClusterMixin

Mini-Batch K-Means for large-scale clustering.

Trades small accuracy loss for massive speed gains on datasets >100K by using random mini-batches instead of full passes.

Parameters:

n_clusters (int, default=8) – Number of clusters.
batch_size (int, default=1024) – Mini-batch size.
init (str, default='k-means++') – Initialization method.
n_init (int, default=10) – Number of initializations.
max_iter (int, default=300) – Maximum iterations.
max_no_improvement (int, default=10) – Early stopping patience.
random_state (int or None, default=None) – Random seed.

labels_¶

Cluster labels.

Type:: ndarray of shape (n_samples,)

cluster_centers_¶

Centroids.

Type:: ndarray of shape (n_clusters, n_features)

inertia_¶

Sum of squared distances.

Type:: float

fit(X, y=None)[source]¶

Fit Mini-Batch K-Means.

Return type:: MiniBatchKMeansClusterer
Parameters:: X (ArrayLike)

predict(X)[source]¶

Predict cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

partial_fit(X, y=None)[source]¶

Incremental fit on a batch of data.

Return type:: MiniBatchKMeansClusterer
Parameters:: X (ArrayLike)

class endgame.clustering.KStarMeansClusterer(k_init=2, k_max=50, max_splits=20, max_iter=300, random_state=None)[source]¶

Bases: BaseEstimator, ClusterMixin

k*-Means: automatic k determination via Minimum Description Length.

Extends K-Means by splitting and merging clusters based on MDL cost. Starts with k_init clusters and iteratively splits clusters that reduce description length and merges clusters that increase it.

Parameters:

k_init (int, default=2) – Initial number of clusters.
k_max (int, default=50) – Maximum number of clusters to consider.
max_splits (int, default=20) – Maximum split/merge iterations.
max_iter (int, default=300) – K-Means iterations per refinement step.
random_state (int or None, default=None) – Random seed.

labels_¶

Cluster labels.

Type:: ndarray of shape (n_samples,)

cluster_centers_¶

Centroids.

Type:: ndarray of shape (k_optimal, n_features)

n_clusters_¶

Optimal number of clusters found.

Type:: int

mdl_history_¶

MDL cost at each iteration.

Type:: list of float

References

k*-Means (2025): automatic k via MDL sub-cluster splitting.

fit(X, y=None)[source]¶

Fit k*-Means with automatic k selection.

Return type:: KStarMeansClusterer
Parameters:: X (ArrayLike)

predict(X)[source]¶

Predict cluster labels for new data.

Return type:: ndarray
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

class endgame.clustering.DBSCANClusterer(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, n_jobs=-1)[source]¶

Bases: BaseEstimator, ClusterMixin

DBSCAN density-based clustering with competition defaults.

Finds arbitrary-shaped clusters and labels noise points as -1.

Parameters:

eps (float, default=0.5) – Neighbourhood radius.
min_samples (int, default=5) – Minimum samples in a neighbourhood for a core point.
metric (str, default='euclidean') – Distance metric.
algorithm (str, default='auto') – Nearest neighbours algorithm: ‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’.
leaf_size (int, default=30) – Leaf size for tree-based algorithms.
n_jobs (int, default=-1) – Parallel jobs.

labels_¶

Cluster labels (-1 for noise).

Type:: ndarray of shape (n_samples,)

core_sample_indices_¶

Indices of core samples.

Type:: ndarray

n_clusters_¶

Number of clusters found (excluding noise).

Type:: int

fit(X, y=None)[source]¶

Fit DBSCAN.

Return type:: DBSCANClusterer
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

class endgame.clustering.HDBSCANClusterer(min_cluster_size=15, min_samples=None, metric='euclidean', cluster_selection_method='eom', cluster_selection_epsilon=0.0, alpha=1.0, allow_single_cluster=False, n_jobs=-1)[source]¶

Bases: BaseEstimator, ClusterMixin

HDBSCAN hierarchical density-based clustering.

Runs DBSCAN across all eps values simultaneously via mutual reachability MST, extracting the most stable clusters. Only real param is min_cluster_size. Handles variable-density clusters.

Uses sklearn’s HDBSCAN (>=1.3) with fallback to the hdbscan package.

Parameters:

min_cluster_size (int, default=15) – Minimum cluster size.
min_samples (int or None, default=None) – Core distance samples. Defaults to min_cluster_size.
metric (str, default='euclidean') – Distance metric.
cluster_selection_method (str, default='eom') – Cluster extraction: ‘eom’ (Excess of Mass) or ‘leaf’.
cluster_selection_epsilon (float, default=0.0) – Distance threshold for merging clusters.
alpha (float, default=1.0) – Mutual reachability smoothing.
allow_single_cluster (bool, default=False) – Whether to allow a single-cluster result.
n_jobs (int, default=-1) – Parallel jobs.

labels_¶

Cluster labels (-1 for noise).

Type:: ndarray of shape (n_samples,)

probabilities_¶

Cluster membership probabilities.

Type:: ndarray of shape (n_samples,)

n_clusters_¶

Number of clusters found.

Type:: int

fit(X, y=None)[source]¶

Fit HDBSCAN.

Return type:: HDBSCANClusterer
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

class endgame.clustering.OPTICSClusterer(min_samples=5, max_eps=inf, metric='minkowski', p=2, cluster_method='xi', xi=0.05, min_cluster_size=None, n_jobs=-1)[source]¶

Bases: BaseEstimator, ClusterMixin

OPTICS ordering-based clustering.

Produces a reachability plot and extracts clusters, generalizing DBSCAN to handle varying density.

Parameters:

min_samples (int or float, default=5) – Core distance parameter.
max_eps (float, default=inf) – Maximum neighbourhood radius.
metric (str, default='minkowski') – Distance metric.
p (float, default=2) – Minkowski power (2 = Euclidean).
cluster_method (str, default='xi') – Extraction method: ‘xi’ or ‘dbscan’.
xi (float, default=0.05) – Steepness threshold for xi extraction.
min_cluster_size (int or float or None, default=None) – Minimum cluster size for extraction.
n_jobs (int, default=-1) – Parallel jobs.

labels_¶

Cluster labels (-1 for noise).

Type:: ndarray of shape (n_samples,)

reachability_¶

Reachability distances.

Type:: ndarray of shape (n_samples,)

ordering_¶

OPTICS ordering.

Type:: ndarray of shape (n_samples,)

n_clusters_¶

Number of clusters found.

Type:: int

fit(X, y=None)[source]¶

Fit OPTICS.

Return type:: OPTICSClusterer
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

class endgame.clustering.DensityPeaksClusterer(n_clusters=None, percent=2.0, gamma_threshold=None, metric='euclidean', random_state=None)[source]¶

Bases: BaseEstimator, ClusterMixin

Density Peaks Clustering (DPC).

Cluster centres are points with simultaneously high local density (rho) and large distance to any denser point (delta). Points are assigned by following the chain to the nearest denser neighbour.

Parameters:

n_clusters (int or None, default=None) – Number of clusters. If None, auto-select from the decision graph using gamma_threshold.
percent (float, default=2.0) – Percentage of data to use as cutoff distance (d_c) for density estimation. E.g. 2.0 means d_c is the distance at the 2nd percentile of all pairwise distances.
gamma_threshold (float or None, default=None) – If n_clusters is None, points with rho * delta above this threshold are chosen as centres. If None, uses Otsu-like thresholding on gamma values.
metric (str, default='euclidean') – Distance metric.
random_state (int or None, default=None) – Random seed (for tie-breaking).

labels_¶

Cluster labels.

Type:: ndarray of shape (n_samples,)

rho_¶

Local densities.

Type:: ndarray of shape (n_samples,)

delta_¶

Distance to nearest denser point.

Type:: ndarray of shape (n_samples,)

centers_¶

Indices of cluster centres.

Type:: ndarray of shape (n_centers,)

n_clusters_¶

Number of clusters found.

Type:: int

References

Rodriguez & Laio, “Clustering by fast search and find of density peaks”, Science, 2014.

fit(X, y=None)[source]¶

Fit DPC.

Return type:: DensityPeaksClusterer
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

class endgame.clustering.AgglomerativeClusterer(n_clusters=2, linkage='ward', metric='euclidean', distance_threshold=None, connectivity=None, compute_full_tree='auto', compute_distances=False)[source]¶

Bases: BaseEstimator, ClusterMixin

Agglomerative hierarchical clustering with multiple linkage options.

Ward’s linkage (default) is the strongest general-purpose option. Average linkage is robust. Single linkage is fast but chaining-sensitive. Complete linkage produces compact clusters.

Parameters:

n_clusters (int or None, default=2) – Number of clusters. If None, must provide distance_threshold.
linkage (str, default='ward') – Linkage criterion: ‘ward’, ‘average’, ‘complete’, ‘single’.
metric (str, default='euclidean') – Distance metric (only used with non-ward linkage).
distance_threshold (float or None, default=None) – Distance threshold for stopping. If set, n_clusters must be None.
connectivity (array-like or callable or None, default=None) – Connectivity constraints.
compute_full_tree (bool or 'auto', default='auto') – Whether to compute the full dendrogram.
compute_distances (bool, default=False) – Whether to compute distances between clusters.

labels_¶

Cluster labels.

Type:: ndarray of shape (n_samples,)

n_clusters_¶

Number of clusters.

Type:: int

n_leaves_¶

Number of leaves in the dendrogram.

Type:: int

children_¶

Merge history.

Type:: ndarray of shape (n_nodes-1, 2)

distances_¶

Distances between merged clusters (if compute_distances=True).

Type:: ndarray or None

fit(X, y=None)[source]¶

Fit agglomerative clustering.

Return type:: AgglomerativeClusterer
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

class endgame.clustering.GaussianMixtureClusterer(n_components=8, covariance_type='full', n_init=5, max_iter=200, tol=0.001, reg_covar=1e-06, init_params='k-means++', random_state=None)[source]¶

Bases: BaseEstimator, ClusterMixin

Gaussian Mixture Model clustering.

Fits k Gaussians via EM. The probabilistic analog of K-Means — gives soft assignments and handles elliptical clusters. Supports BIC/AIC for model selection.

Parameters:

n_components (int, default=8) – Number of mixture components.
covariance_type (str, default='full') – Covariance type: ‘full’, ‘tied’, ‘diag’, ‘spherical’.
n_init (int, default=5) – Number of EM initializations.
max_iter (int, default=200) – Maximum EM iterations.
tol (float, default=1e-3) – Convergence tolerance.
reg_covar (float, default=1e-6) – Covariance regularization.
init_params (str, default='k-means++') – Initialization: ‘kmeans’, ‘k-means++’, ‘random’, ‘random_from_data’.
random_state (int or None, default=None) – Random seed.

labels_¶

Hard cluster assignments (argmax of responsibilities).

Type:: ndarray of shape (n_samples,)

probabilities_¶

Soft assignment probabilities (responsibilities).

Type:: ndarray of shape (n_samples, n_components)

means_¶

Component means.

Type:: ndarray of shape (n_components, n_features)

covariances_¶

Component covariances.

Type:: ndarray

weights_¶

Mixing weights.

Type:: ndarray of shape (n_components,)

bic_¶

Bayesian Information Criterion of the fitted model.

Type:: float

aic_¶

Akaike Information Criterion of the fitted model.

Type:: float

fit(X, y=None)[source]¶

Fit GMM.

Return type:: GaussianMixtureClusterer
Parameters:: X (ArrayLike)

predict(X)[source]¶

Predict hard cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

predict_proba(X)[source]¶

Predict soft cluster probabilities.

Return type:: ndarray
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return hard cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

score(X)[source]¶

Return average log-likelihood.

Return type:: float
Parameters:: X (ArrayLike)

select_n_components(X, k_range=None, criterion='bic')[source]¶

Select optimal n_components via BIC or AIC.

Parameters:

X (array-like) – Data to evaluate.
k_range (range or None, default=None) – Range of k values. Defaults to range(1, 21).
criterion (str, default='bic') – Selection criterion: ‘bic’ or ‘aic’.

Return type:

int

Returns:

int – Optimal number of components.

class endgame.clustering.FuzzyCMeansClusterer(n_clusters=8, m=2.0, max_iter=300, tol=0.0001, random_state=None)[source]¶

Bases: BaseEstimator, ClusterMixin

Fuzzy C-Means clustering.

Soft version of K-Means where each point has a degree of membership in each cluster. Useful when clusters genuinely overlap.

Parameters:

n_clusters (int, default=8) – Number of clusters.
m (float, default=2.0) – Fuzziness coefficient (m > 1). Higher values = softer assignments. m = 1 approaches hard K-Means; m >> 1 approaches uniform membership.
max_iter (int, default=300) – Maximum iterations.
tol (float, default=1e-4) – Convergence tolerance on membership matrix change.
random_state (int or None, default=None) – Random seed.

labels_¶

Hard cluster labels (argmax of membership).

Type:: ndarray of shape (n_samples,)

membership_¶

Fuzzy membership matrix.

Type:: ndarray of shape (n_samples, n_clusters)

cluster_centers_¶

Cluster centroids.

Type:: ndarray of shape (n_clusters, n_features)

n_iter_¶

Number of iterations run.

Type:: int

fit(X, y=None)[source]¶

Fit Fuzzy C-Means.

Return type:: FuzzyCMeansClusterer
Parameters:: X (ArrayLike)

predict(X)[source]¶

Predict hard cluster labels for new data.

Return type:: ndarray
Parameters:: X (ArrayLike)

predict_memberships(X)[source]¶

Predict fuzzy membership for new data.

Parameters:: X (array-like of shape (n_samples, n_features))
Return type:: ndarray
Returns:: ndarray of shape (n_samples, n_clusters) – Membership matrix.

fit_predict(X, y=None)[source]¶

Fit and return hard cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

class endgame.clustering.SpectralClusterer(n_clusters=8, affinity='rbf', gamma=None, n_neighbors=10, n_init=10, assign_labels='kmeans', random_state=None, n_jobs=-1)[source]¶

Bases: BaseEstimator, ClusterMixin

Spectral clustering via graph Laplacian eigenvectors.

Constructs a similarity graph, computes eigenvectors of the graph Laplacian, then runs k-means in the spectral embedding. Excels at non-convex clusters (concentric circles, spirals).

Parameters:

n_clusters (int, default=8) – Number of clusters.
affinity (str, default='rbf') – Similarity measure: ‘rbf’, ‘nearest_neighbors’, ‘precomputed’.
gamma (float or None, default=None) – RBF kernel bandwidth. If None, uses 1/n_features.
n_neighbors (int, default=10) – Number of neighbours for ‘nearest_neighbors’ affinity.
n_init (int, default=10) – k-means initializations in spectral space.
assign_labels (str, default='kmeans') – Label assignment: ‘kmeans’ or ‘discretize’.
random_state (int or None, default=None) – Random seed.
n_jobs (int, default=-1) – Parallel jobs.

labels_¶

Cluster labels.

Type:: ndarray of shape (n_samples,)

affinity_matrix_¶

Computed affinity matrix.

Type:: ndarray of shape (n_samples, n_samples)

n_clusters_¶

Number of clusters.

Type:: int

fit(X, y=None)[source]¶

Fit spectral clustering.

Return type:: SpectralClusterer
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

class endgame.clustering.AffinityPropagationClusterer(damping=0.5, max_iter=200, convergence_iter=15, preference=None, affinity='euclidean', random_state=None)[source]¶

Bases: BaseEstimator, ClusterMixin

Affinity Propagation clustering via message passing.

Simultaneously chooses exemplars and assigns points via responsibility and availability messages. No k required.

Parameters:

damping (float, default=0.5) – Damping factor (0.5 to 1). Higher = more stable but slower.
max_iter (int, default=200) – Maximum message-passing iterations.
convergence_iter (int, default=15) – Iterations without change for convergence.
preference (float or array-like or None, default=None) – Preference for each point to be an exemplar. Larger = more clusters. None uses the median of the similarity matrix.
affinity (str, default='euclidean') – Affinity type: ‘euclidean’ or ‘precomputed’.
random_state (int or None, default=None) – Random seed.

labels_¶

Cluster labels.

Type:: ndarray of shape (n_samples,)

cluster_centers_indices_¶

Indices of exemplar points.

Type:: ndarray

cluster_centers_¶

Exemplar coordinates.

Type:: ndarray of shape (n_clusters, n_features)

n_clusters_¶

Number of clusters found.

Type:: int

n_iter_¶

Iterations run.

Type:: int

fit(X, y=None)[source]¶

Fit Affinity Propagation.

Return type:: AffinityPropagationClusterer
Parameters:: X (ArrayLike)

predict(X)[source]¶

Predict cluster labels for new data.

Return type:: ndarray
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

class endgame.clustering.BIRCHClusterer(n_clusters=3, threshold=0.5, branching_factor=50, compute_labels=True)[source]¶

Bases: BaseEstimator, ClusterMixin

BIRCH incremental hierarchical clustering.

Builds a CF-tree (Clustering Feature tree) for incremental clustering. Designed for very large datasets or streaming scenarios.

Parameters:

n_clusters (int or None, default=3) – Final number of clusters. If None, the subclusters from the CF-tree leaf nodes are returned directly.
threshold (float, default=0.5) – CF-tree leaf radius threshold.
branching_factor (int, default=50) – Maximum CF entries per node.
compute_labels (bool, default=True) – Whether to compute labels for training data.

labels_¶

Cluster labels.

Type:: ndarray of shape (n_samples,)

subcluster_centers_¶

CF-tree subcluster centres.

Type:: ndarray

n_clusters_¶

Number of clusters.

Type:: int

fit(X, y=None)[source]¶

Fit BIRCH.

Return type:: BIRCHClusterer
Parameters:: X (ArrayLike)

predict(X)[source]¶

Predict cluster labels for new data.

Return type:: ndarray
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

partial_fit(X, y=None)[source]¶

Incremental fit on a batch of data.

Return type:: BIRCHClusterer
Parameters:: X (ArrayLike)

class endgame.clustering.MeanShiftClusterer(bandwidth=None, bin_seeding=False, min_bin_freq=1, cluster_all=True, n_jobs=-1)[source]¶

Bases: BaseEstimator, ClusterMixin

Mean Shift mode-finding clustering.

Non-parametric mode finding via kernel density gradient ascent. Automatically determines k by finding density modes.

Parameters:

bandwidth (float or None, default=None) – Kernel bandwidth. If None, estimated automatically.
bin_seeding (bool, default=False) – Speed up by discretising seed points.
min_bin_freq (int, default=1) – Minimum bin frequency for seeding.
cluster_all (bool, default=True) – If False, orphan points get label -1.
n_jobs (int, default=-1) – Parallel jobs.

labels_¶

Cluster labels.

Type:: ndarray of shape (n_samples,)

cluster_centers_¶

Mode locations.

Type:: ndarray of shape (n_clusters, n_features)

n_clusters_¶

Number of clusters found.

Type:: int

fit(X, y=None)[source]¶

Fit Mean Shift.

Return type:: MeanShiftClusterer
Parameters:: X (ArrayLike)

predict(X)[source]¶

Predict cluster labels for new data.

Return type:: ndarray
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

class endgame.clustering.AutoCluster(n_clusters='auto', detect_noise=False, prefer=None, random_state=None, verbose=False, **kwargs)[source]¶

Bases: BaseEstimator, ClusterMixin

Automatic clustering with method selection based on data properties.

Selects the best clustering algorithm based on: - Dataset size (n) - Dimensionality (d) - Whether k is specified - Whether noise detection is needed

Parameters:

n_clusters (int or 'auto', default='auto') – Number of clusters. ‘auto’ uses algorithms that determine k automatically (HDBSCAN, k*-Means, or GMM with BIC).
detect_noise (bool, default=False) – Whether to detect noise/outlier points (label -1). If True, prefers density-based methods (HDBSCAN, DBSCAN).
prefer (str or None, default=None) – Override automatic selection: ‘centroid’, ‘density’, ‘hierarchical’, ‘distribution’, ‘spectral’. If None, auto-selects.
random_state (int or None, default=None) – Random seed.
verbose (bool, default=False) – Enable verbose output.
**kwargs – Additional parameters passed to the selected clusterer.

labels_¶

Cluster labels.

Type:: ndarray of shape (n_samples,)

selected_method_¶

Name of the selected algorithm.

Type:: str

clusterer_¶

The fitted clusterer instance.

Type:: BaseEstimator

n_clusters_¶

Number of clusters found.

Type:: int

Examples

>>> from endgame.clustering import AutoCluster
>>> ac = AutoCluster(n_clusters='auto', detect_noise=True)
>>> labels = ac.fit_predict(X)
>>> print(f"Selected: {ac.selected_method_}, k={ac.n_clusters_}")

fit(X, y=None)[source]¶

Fit the auto-selected clusterer.

Parameters:

X (array-like of shape (n_samples, n_features))
y (ignored)

Return type:

AutoCluster

Returns:

self

predict(X)[source]¶

Predict cluster labels for new data (if supported).

Parameters:: X (array-like of shape (n_samples, n_features))
Return type:: ndarray
Returns:: ndarray of shape (n_samples,)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

class endgame.clustering.GenieClusterer(n_clusters=2, gini_threshold=0.3, affinity='euclidean', exact=True, compute_full_tree=True, M=1)[source]¶

Bases: BaseEstimator, ClusterMixin

Genie clustering: MST-based with Gini index threshold.

Builds a minimum spanning tree and merges clusters using single linkage, but applies a Gini index threshold on cluster sizes to prevent the pathological chaining behavior. Consistently outperforms Ward and average linkage on standard benchmarks.

Requires the genieclust package.

Parameters:

n_clusters (int, default=2) – Number of clusters.
gini_threshold (float, default=0.3) – Gini index threshold for cluster size inequality. Lower values enforce more balanced clusters. 0 = single linkage, 1 = balanced.
affinity (str, default='euclidean') – Distance metric.
exact (bool, default=True) – Use exact (True) or approximate (False) algorithm.
compute_full_tree (bool, default=True) – Whether to compute the full hierarchy.
M (int, default=1) – Smoothing factor for the mutual reachability distance. M=1 is standard MST; larger M approaches HDBSCAN*-like behavior.

labels_¶

Cluster labels.

Type:: ndarray of shape (n_samples,)

n_clusters_¶

Number of clusters.

Type:: int

References

Gagolewski, M. (2016). “Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm.” Information Sciences. Gagolewski, M. (2025). Journal of Classification.

fit(X, y=None)[source]¶

Fit Genie clustering.

Return type:: GenieClusterer
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)

class endgame.clustering.FINCHClusterer(req_clust=None, distance='euclidean', verbose=False)[source]¶

Bases: BaseEstimator, ClusterMixin

FINCH: First Integer Neighbour Clustering Hierarchy.

Zero-parameter clustering that uses first-neighbour relations to recursively merge clusters in O(n log n) with O(n) memory. Produces a hierarchy of partitions in 4-10 steps.

Requires the finch-clust package.

Parameters:

req_clust (int or None, default=None) – Requested number of clusters. If None, returns the partition at the first hierarchy level where all points are in the same cluster (i.e. the finest reasonable partition).
distance (str, default='euclidean') – Distance metric: ‘euclidean’ or ‘cosine’.
verbose (bool, default=False) – Print hierarchy information.

labels_¶

Cluster labels at the selected partition level.

Type:: ndarray of shape (n_samples,)

all_partitions_¶

All hierarchy levels.

Type:: ndarray of shape (n_samples, n_levels)

n_clusters_¶

Number of clusters at the selected level.

Type:: int

n_levels_¶

Number of hierarchy levels found.

Type:: int

References

Sarfraz et al., “Efficient Parameter-Free Clustering Using First Neighbor Relations”, CVPR 2019.

fit(X, y=None)[source]¶

Fit FINCH.

Return type:: FINCHClusterer
Parameters:: X (ArrayLike)

fit_predict(X, y=None)[source]¶

Fit and return cluster labels.

Return type:: ndarray
Parameters:: X (ArrayLike)