Clustering¶
- class endgame.clustering.KMeansClusterer(n_clusters=8, init='k-means++', n_init=20, max_iter=300, tol=0.0001, algorithm='lloyd', random_state=None, n_jobs=-1)[source]¶
Bases:
BaseEstimator,ClusterMixinK-Means / K-Means++ clustering with competition-tuned defaults.
Wraps sklearn’s KMeans with higher
n_initfor stability andk-means++initialization by default.- Parameters:
n_clusters (int, default=8) – Number of clusters.
init (str or ndarray, default='k-means++') – Initialization method: ‘k-means++’, ‘random’, or centroid array.
n_init (int, default=20) – Number of initializations (higher than sklearn’s 10 for stability).
max_iter (int, default=300) – Maximum iterations per run.
tol (float, default=1e-4) – Convergence tolerance.
algorithm (str, default='lloyd') – Algorithm: ‘lloyd’ or ‘elkan’.
random_state (int or None, default=None) – Random seed.
n_jobs (int, default=-1) – Parallel jobs.
- cluster_centers_¶
Centroids.
- Type:
ndarray of shape (n_clusters, n_features)
- class endgame.clustering.MiniBatchKMeansClusterer(n_clusters=8, batch_size=1024, init='k-means++', n_init=10, max_iter=300, max_no_improvement=10, random_state=None)[source]¶
Bases:
BaseEstimator,ClusterMixinMini-Batch K-Means for large-scale clustering.
Trades small accuracy loss for massive speed gains on datasets >100K by using random mini-batches instead of full passes.
- Parameters:
n_clusters (int, default=8) – Number of clusters.
batch_size (int, default=1024) – Mini-batch size.
init (str, default='k-means++') – Initialization method.
n_init (int, default=10) – Number of initializations.
max_iter (int, default=300) – Maximum iterations.
max_no_improvement (int, default=10) – Early stopping patience.
random_state (int or None, default=None) – Random seed.
- cluster_centers_¶
Centroids.
- Type:
ndarray of shape (n_clusters, n_features)
- fit_predict(X, y=None)[source]¶
Fit and return cluster labels.
- Return type:
- Parameters:
X (ArrayLike)
- class endgame.clustering.KStarMeansClusterer(k_init=2, k_max=50, max_splits=20, max_iter=300, random_state=None)[source]¶
Bases:
BaseEstimator,ClusterMixink*-Means: automatic k determination via Minimum Description Length.
Extends K-Means by splitting and merging clusters based on MDL cost. Starts with
k_initclusters and iteratively splits clusters that reduce description length and merges clusters that increase it.- Parameters:
k_init (int, default=2) – Initial number of clusters.
k_max (int, default=50) – Maximum number of clusters to consider.
max_splits (int, default=20) – Maximum split/merge iterations.
max_iter (int, default=300) – K-Means iterations per refinement step.
random_state (int or None, default=None) – Random seed.
- cluster_centers_¶
Centroids.
- Type:
ndarray of shape (k_optimal, n_features)
References
k*-Means (2025): automatic k via MDL sub-cluster splitting.
- fit(X, y=None)[source]¶
Fit k*-Means with automatic k selection.
- Return type:
- Parameters:
X (ArrayLike)
- class endgame.clustering.DBSCANClusterer(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, n_jobs=-1)[source]¶
Bases:
BaseEstimator,ClusterMixinDBSCAN density-based clustering with competition defaults.
Finds arbitrary-shaped clusters and labels noise points as -1.
- Parameters:
eps (float, default=0.5) – Neighbourhood radius.
min_samples (int, default=5) – Minimum samples in a neighbourhood for a core point.
metric (str, default='euclidean') – Distance metric.
algorithm (str, default='auto') – Nearest neighbours algorithm: ‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’.
leaf_size (int, default=30) – Leaf size for tree-based algorithms.
n_jobs (int, default=-1) – Parallel jobs.
- core_sample_indices_¶
Indices of core samples.
- Type:
ndarray
- class endgame.clustering.HDBSCANClusterer(min_cluster_size=15, min_samples=None, metric='euclidean', cluster_selection_method='eom', cluster_selection_epsilon=0.0, alpha=1.0, allow_single_cluster=False, n_jobs=-1)[source]¶
Bases:
BaseEstimator,ClusterMixinHDBSCAN hierarchical density-based clustering.
Runs DBSCAN across all eps values simultaneously via mutual reachability MST, extracting the most stable clusters. Only real param is
min_cluster_size. Handles variable-density clusters.Uses sklearn’s HDBSCAN (>=1.3) with fallback to the
hdbscanpackage.- Parameters:
min_cluster_size (int, default=15) – Minimum cluster size.
min_samples (int or None, default=None) – Core distance samples. Defaults to
min_cluster_size.metric (str, default='euclidean') – Distance metric.
cluster_selection_method (str, default='eom') – Cluster extraction: ‘eom’ (Excess of Mass) or ‘leaf’.
cluster_selection_epsilon (float, default=0.0) – Distance threshold for merging clusters.
alpha (float, default=1.0) – Mutual reachability smoothing.
allow_single_cluster (bool, default=False) – Whether to allow a single-cluster result.
n_jobs (int, default=-1) – Parallel jobs.
- class endgame.clustering.OPTICSClusterer(min_samples=5, max_eps=inf, metric='minkowski', p=2, cluster_method='xi', xi=0.05, min_cluster_size=None, n_jobs=-1)[source]¶
Bases:
BaseEstimator,ClusterMixinOPTICS ordering-based clustering.
Produces a reachability plot and extracts clusters, generalizing DBSCAN to handle varying density.
- Parameters:
min_samples (int or float, default=5) – Core distance parameter.
max_eps (float, default=inf) – Maximum neighbourhood radius.
metric (str, default='minkowski') – Distance metric.
p (float, default=2) – Minkowski power (2 = Euclidean).
cluster_method (str, default='xi') – Extraction method: ‘xi’ or ‘dbscan’.
xi (float, default=0.05) – Steepness threshold for xi extraction.
min_cluster_size (int or float or None, default=None) – Minimum cluster size for extraction.
n_jobs (int, default=-1) – Parallel jobs.
- class endgame.clustering.DensityPeaksClusterer(n_clusters=None, percent=2.0, gamma_threshold=None, metric='euclidean', random_state=None)[source]¶
Bases:
BaseEstimator,ClusterMixinDensity Peaks Clustering (DPC).
Cluster centres are points with simultaneously high local density (rho) and large distance to any denser point (delta). Points are assigned by following the chain to the nearest denser neighbour.
- Parameters:
n_clusters (int or None, default=None) – Number of clusters. If None, auto-select from the decision graph using
gamma_threshold.percent (float, default=2.0) – Percentage of data to use as cutoff distance (d_c) for density estimation. E.g. 2.0 means d_c is the distance at the 2nd percentile of all pairwise distances.
gamma_threshold (float or None, default=None) – If
n_clustersis None, points withrho * deltaabove this threshold are chosen as centres. If None, uses Otsu-like thresholding on gamma values.metric (str, default='euclidean') – Distance metric.
random_state (int or None, default=None) – Random seed (for tie-breaking).
- centers_¶
Indices of cluster centres.
- Type:
ndarray of shape (n_centers,)
References
Rodriguez & Laio, “Clustering by fast search and find of density peaks”, Science, 2014.
- class endgame.clustering.AgglomerativeClusterer(n_clusters=2, linkage='ward', metric='euclidean', distance_threshold=None, connectivity=None, compute_full_tree='auto', compute_distances=False)[source]¶
Bases:
BaseEstimator,ClusterMixinAgglomerative hierarchical clustering with multiple linkage options.
Ward’s linkage (default) is the strongest general-purpose option. Average linkage is robust. Single linkage is fast but chaining-sensitive. Complete linkage produces compact clusters.
- Parameters:
n_clusters (int or None, default=2) – Number of clusters. If None, must provide
distance_threshold.linkage (str, default='ward') – Linkage criterion: ‘ward’, ‘average’, ‘complete’, ‘single’.
metric (str, default='euclidean') – Distance metric (only used with non-ward linkage).
distance_threshold (float or None, default=None) – Distance threshold for stopping. If set,
n_clustersmust be None.connectivity (array-like or callable or None, default=None) – Connectivity constraints.
compute_full_tree (bool or 'auto', default='auto') – Whether to compute the full dendrogram.
compute_distances (bool, default=False) – Whether to compute distances between clusters.
- children_¶
Merge history.
- Type:
ndarray of shape (n_nodes-1, 2)
- distances_¶
Distances between merged clusters (if
compute_distances=True).- Type:
ndarray or None
- class endgame.clustering.GaussianMixtureClusterer(n_components=8, covariance_type='full', n_init=5, max_iter=200, tol=0.001, reg_covar=1e-06, init_params='k-means++', random_state=None)[source]¶
Bases:
BaseEstimator,ClusterMixinGaussian Mixture Model clustering.
Fits k Gaussians via EM. The probabilistic analog of K-Means — gives soft assignments and handles elliptical clusters. Supports BIC/AIC for model selection.
- Parameters:
n_components (int, default=8) – Number of mixture components.
covariance_type (str, default='full') – Covariance type: ‘full’, ‘tied’, ‘diag’, ‘spherical’.
n_init (int, default=5) – Number of EM initializations.
max_iter (int, default=200) – Maximum EM iterations.
tol (float, default=1e-3) – Convergence tolerance.
reg_covar (float, default=1e-6) – Covariance regularization.
init_params (str, default='k-means++') – Initialization: ‘kmeans’, ‘k-means++’, ‘random’, ‘random_from_data’.
random_state (int or None, default=None) – Random seed.
- probabilities_¶
Soft assignment probabilities (responsibilities).
- Type:
ndarray of shape (n_samples, n_components)
- means_¶
Component means.
- Type:
ndarray of shape (n_components, n_features)
- covariances_¶
Component covariances.
- Type:
ndarray
- weights_¶
Mixing weights.
- Type:
ndarray of shape (n_components,)
- predict_proba(X)[source]¶
Predict soft cluster probabilities.
- Return type:
- Parameters:
X (ArrayLike)
- fit_predict(X, y=None)[source]¶
Fit and return hard cluster labels.
- Return type:
- Parameters:
X (ArrayLike)
- class endgame.clustering.FuzzyCMeansClusterer(n_clusters=8, m=2.0, max_iter=300, tol=0.0001, random_state=None)[source]¶
Bases:
BaseEstimator,ClusterMixinFuzzy C-Means clustering.
Soft version of K-Means where each point has a degree of membership in each cluster. Useful when clusters genuinely overlap.
- Parameters:
n_clusters (int, default=8) – Number of clusters.
m (float, default=2.0) – Fuzziness coefficient (m > 1). Higher values = softer assignments. m = 1 approaches hard K-Means; m >> 1 approaches uniform membership.
max_iter (int, default=300) – Maximum iterations.
tol (float, default=1e-4) – Convergence tolerance on membership matrix change.
random_state (int or None, default=None) – Random seed.
- cluster_centers_¶
Cluster centroids.
- Type:
ndarray of shape (n_clusters, n_features)
- predict(X)[source]¶
Predict hard cluster labels for new data.
- Return type:
- Parameters:
X (ArrayLike)
- predict_memberships(X)[source]¶
Predict fuzzy membership for new data.
- Parameters:
X (array-like of shape (n_samples, n_features))
- Return type:
- Returns:
ndarray of shape (n_samples, n_clusters) – Membership matrix.
- class endgame.clustering.SpectralClusterer(n_clusters=8, affinity='rbf', gamma=None, n_neighbors=10, n_init=10, assign_labels='kmeans', random_state=None, n_jobs=-1)[source]¶
Bases:
BaseEstimator,ClusterMixinSpectral clustering via graph Laplacian eigenvectors.
Constructs a similarity graph, computes eigenvectors of the graph Laplacian, then runs k-means in the spectral embedding. Excels at non-convex clusters (concentric circles, spirals).
- Parameters:
n_clusters (int, default=8) – Number of clusters.
affinity (str, default='rbf') – Similarity measure: ‘rbf’, ‘nearest_neighbors’, ‘precomputed’.
gamma (float or None, default=None) – RBF kernel bandwidth. If None, uses 1/n_features.
n_neighbors (int, default=10) – Number of neighbours for ‘nearest_neighbors’ affinity.
n_init (int, default=10) – k-means initializations in spectral space.
assign_labels (str, default='kmeans') – Label assignment: ‘kmeans’ or ‘discretize’.
random_state (int or None, default=None) – Random seed.
n_jobs (int, default=-1) – Parallel jobs.
- class endgame.clustering.AffinityPropagationClusterer(damping=0.5, max_iter=200, convergence_iter=15, preference=None, affinity='euclidean', random_state=None)[source]¶
Bases:
BaseEstimator,ClusterMixinAffinity Propagation clustering via message passing.
Simultaneously chooses exemplars and assigns points via responsibility and availability messages. No k required.
- Parameters:
damping (float, default=0.5) – Damping factor (0.5 to 1). Higher = more stable but slower.
max_iter (int, default=200) – Maximum message-passing iterations.
convergence_iter (int, default=15) – Iterations without change for convergence.
preference (float or array-like or None, default=None) – Preference for each point to be an exemplar. Larger = more clusters. None uses the median of the similarity matrix.
affinity (str, default='euclidean') – Affinity type: ‘euclidean’ or ‘precomputed’.
random_state (int or None, default=None) – Random seed.
- cluster_centers_indices_¶
Indices of exemplar points.
- Type:
ndarray
- cluster_centers_¶
Exemplar coordinates.
- Type:
ndarray of shape (n_clusters, n_features)
- class endgame.clustering.BIRCHClusterer(n_clusters=3, threshold=0.5, branching_factor=50, compute_labels=True)[source]¶
Bases:
BaseEstimator,ClusterMixinBIRCH incremental hierarchical clustering.
Builds a CF-tree (Clustering Feature tree) for incremental clustering. Designed for very large datasets or streaming scenarios.
- Parameters:
n_clusters (int or None, default=3) – Final number of clusters. If None, the subclusters from the CF-tree leaf nodes are returned directly.
threshold (float, default=0.5) – CF-tree leaf radius threshold.
branching_factor (int, default=50) – Maximum CF entries per node.
compute_labels (bool, default=True) – Whether to compute labels for training data.
- subcluster_centers_¶
CF-tree subcluster centres.
- Type:
ndarray
- fit_predict(X, y=None)[source]¶
Fit and return cluster labels.
- Return type:
- Parameters:
X (ArrayLike)
- class endgame.clustering.MeanShiftClusterer(bandwidth=None, bin_seeding=False, min_bin_freq=1, cluster_all=True, n_jobs=-1)[source]¶
Bases:
BaseEstimator,ClusterMixinMean Shift mode-finding clustering.
Non-parametric mode finding via kernel density gradient ascent. Automatically determines k by finding density modes.
- Parameters:
bandwidth (float or None, default=None) – Kernel bandwidth. If None, estimated automatically.
bin_seeding (bool, default=False) – Speed up by discretising seed points.
min_bin_freq (int, default=1) – Minimum bin frequency for seeding.
cluster_all (bool, default=True) – If False, orphan points get label -1.
n_jobs (int, default=-1) – Parallel jobs.
- cluster_centers_¶
Mode locations.
- Type:
ndarray of shape (n_clusters, n_features)
- class endgame.clustering.AutoCluster(n_clusters='auto', detect_noise=False, prefer=None, random_state=None, verbose=False, **kwargs)[source]¶
Bases:
BaseEstimator,ClusterMixinAutomatic clustering with method selection based on data properties.
Selects the best clustering algorithm based on: - Dataset size (n) - Dimensionality (d) - Whether k is specified - Whether noise detection is needed
- Parameters:
n_clusters (int or 'auto', default='auto') – Number of clusters. ‘auto’ uses algorithms that determine k automatically (HDBSCAN, k*-Means, or GMM with BIC).
detect_noise (bool, default=False) – Whether to detect noise/outlier points (label -1). If True, prefers density-based methods (HDBSCAN, DBSCAN).
prefer (str or None, default=None) – Override automatic selection: ‘centroid’, ‘density’, ‘hierarchical’, ‘distribution’, ‘spectral’. If None, auto-selects.
random_state (int or None, default=None) – Random seed.
verbose (bool, default=False) – Enable verbose output.
**kwargs – Additional parameters passed to the selected clusterer.
- clusterer_¶
The fitted clusterer instance.
- Type:
BaseEstimator
Examples
>>> from endgame.clustering import AutoCluster >>> ac = AutoCluster(n_clusters='auto', detect_noise=True) >>> labels = ac.fit_predict(X) >>> print(f"Selected: {ac.selected_method_}, k={ac.n_clusters_}")
- fit(X, y=None)[source]¶
Fit the auto-selected clusterer.
- Parameters:
X (array-like of shape (n_samples, n_features))
y (ignored)
- Return type:
- Returns:
self
- predict(X)[source]¶
Predict cluster labels for new data (if supported).
- Parameters:
X (array-like of shape (n_samples, n_features))
- Return type:
- Returns:
ndarray of shape (n_samples,)
- class endgame.clustering.GenieClusterer(n_clusters=2, gini_threshold=0.3, affinity='euclidean', exact=True, compute_full_tree=True, M=1)[source]¶
Bases:
BaseEstimator,ClusterMixinGenie clustering: MST-based with Gini index threshold.
Builds a minimum spanning tree and merges clusters using single linkage, but applies a Gini index threshold on cluster sizes to prevent the pathological chaining behavior. Consistently outperforms Ward and average linkage on standard benchmarks.
Requires the
genieclustpackage.- Parameters:
n_clusters (int, default=2) – Number of clusters.
gini_threshold (float, default=0.3) – Gini index threshold for cluster size inequality. Lower values enforce more balanced clusters. 0 = single linkage, 1 = balanced.
affinity (str, default='euclidean') – Distance metric.
exact (bool, default=True) – Use exact (True) or approximate (False) algorithm.
compute_full_tree (bool, default=True) – Whether to compute the full hierarchy.
M (int, default=1) – Smoothing factor for the mutual reachability distance. M=1 is standard MST; larger M approaches HDBSCAN*-like behavior.
References
Gagolewski, M. (2016). “Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm.” Information Sciences. Gagolewski, M. (2025). Journal of Classification.
- class endgame.clustering.FINCHClusterer(req_clust=None, distance='euclidean', verbose=False)[source]¶
Bases:
BaseEstimator,ClusterMixinFINCH: First Integer Neighbour Clustering Hierarchy.
Zero-parameter clustering that uses first-neighbour relations to recursively merge clusters in O(n log n) with O(n) memory. Produces a hierarchy of partitions in 4-10 steps.
Requires the
finch-clustpackage.- Parameters:
req_clust (int or None, default=None) – Requested number of clusters. If None, returns the partition at the first hierarchy level where all points are in the same cluster (i.e. the finest reasonable partition).
distance (str, default='euclidean') – Distance metric: ‘euclidean’ or ‘cosine’.
verbose (bool, default=False) – Print hierarchy information.
References
Sarfraz et al., “Efficient Parameter-Free Clustering Using First Neighbor Relations”, CVPR 2019.