Skip to content

Latest commit

 

History

History
211 lines (156 loc) · 10.1 KB

File metadata and controls

211 lines (156 loc) · 10.1 KB

GBSK: Skeleton Clustering via
Granular-Ball Computing and Multi-Sampling
for Large-Scale Data

arXiv Demo Datasets BibTeX

Yewang Chen, Junfeng Li, Shuyin Xia, Qinghong Lai, Xinbo Gao, Guoyin Wang, Dongdong Cheng, Yi Liu, Yi Wang

Granular-ball Skeleton Clustering (GBSK) is a scalable clustering algorithm designed for large-scale data. By constructing multi-grained granular-balls from sampled data, it approximates the underlying structure of data as a compact "skeleton," reducing computation while maintaining accuracy. With linear time complexity (O(n)), GBSK handles massive datasets—up to 100M points in 256 dimension. The adaptive variant, AGBSK, simplifies parameter tuning for ease of use.

e.g. SYN2 e.g. ChainLink

GBSK framework

🚀 Getting Started

Prerequisites

GBSK is currently implemented in MATLAB. The code requires MATLAB version R2021a or later.

Installation

  1. Clone the repositoy:

    git clone https://github.com/XFastDataLab/GBSK.git
  2. Add the GBSK/ directory of the repository to your MATLAB path:

    addpath(genpath('path_to_GBSK'));

Demo

Run demo/demo1.m or demo/demo2.m. Run demo/ClusteringQualityEvaluation.py to get evaluation results.

⚙️ Usage

Basic Workflow

  1. Prepare your dataset in .mat or .txt format, ensuring it contains a variable representing an $n \times d$ matrix (n instances, d feature).

  2. Place your dataset in the datasets/ directoy.

  3. Navigate to the algorithms/ directory in MATLAB.

  4. Run the main scrit: main.m for a dataset that does not exceed memory. Otherwise, edit and run main_big.m.

AGBSK: Recommended Configuration

AGBSK works well with default parameters, requiring only the cluster number k as input:

Parameter Description Default Value
k Number of clusters (root key balls representing centers) User-specified
s Number of sample sets 30
alpha Sampling proportion $\frac{1}{\sqrt{n}}$
M Number of balls per sample set $10 \times k$

GBSK: Advanced Configuration

For expert users, GBSK offers four adjustable parameters with empirically validated ranges:

Parameter Recommended Range Notes
s 10-30 Number of sample sets
alpha Dataset-dependent (e.g. 0.01) Ensure sufficient sampling coverage
k Based on prior knowledge Number of target clusters
M k < M <= n Number of balls per sample set

📊 Output

Upon execution, results are stored in the experiment_records/ directory, organized by dataset name and parameter settings. Each run includes:

  • labels.txt: Cluster labels assigned to each data pont.
  • aggRepBallCenters.txt: Centers of representative balls.
  • keyBallCenters.txt: Centers of final key balls.
  • log.txt: Detailed log of the run, including timing and parameter settigs.

📈 Performace

GBSK is designed for efficiency and scalabiity:

  • Time Complexity: $O(n)$, significantly reduced compared to traditional clustering methods on large dataets.
  • Scalability: Capable of handling datasets with millions of instances.
  • Robustness: Effective in the presence of noise and outliers due to multi-sampling and density-based techniques.

Time scaling comparison of algorithms

📁 Datasets

GBSK has been tested on diverse synthetic and real-world datasets:

Dataset Instances Dims Classes Type Size Usage Label Description
Chainlink 1,000 3 2 S 20KB Q Y Two interlocking 3D rings
Twenty 1,000 2 20 S 17KB Q Y 20 evenly distributed clusters
SYN1 2,000 2 5 S 30KB Q Y Varying cluster densities
Segmentation 2,310 18 7 R 246KB Q Y Image segmentation data
SYN3 3,000 2 2 S 4KB Q Y Mixed density clusters
EngyTime 4,096 2 2 S 76KB Q Y Gaussian distributions with overlap
Waveform 5,000 21 3 R 520KB Q Y Physics waveform data
S3 5,000 2 15 S 136KB Q Y 15 Gaussian clusters
SYN2 5,800 2 7 S 7KB Q Y Non-convex shapes
Pendigits 10,992 16 10 R 685KB Q,S Y Handwritten digits
DryBean 13,611 16 7 R 3.6MB Q,S Y Bean classification
CIFAR-10 60,000 3,072 10 R 174MB Q,S Y Image classification
MNIST 70,000 784 10 R 121MB Q,S Y Handwritten digits
MoCap 78,095 36 5 R 43MB Q,S Y Motion capture data
CoverType 581K 54 7 R 71MB Q,S Y Forest cover types
3M2D5 3M 2 5 S 54MB Q,S Y Large Gaussian mixtures
N-BaIoT 7M 115 9 R 7.6GB S N IoT malware traffic
MNIST8M 8.1M 784 10 R 5.9GB Q,S Y Infinite MNIST
AGC100M 100M 256 17 S 95GB Q,S Y Ultra-large-scale benchmark
  • Type: S = Synthetic, R = Real-world
  • Usage: Q = Quality evaluation, S = Speed testing
  • Label: Y = Labeled, N = Unlabeled

📄 Licnse

This project is licensed under the GNU General Public License v3.0.

🤝 Acknowledgents

Developed and maintained by XFastDataLab. For questions or collaborations, please contact MarveenLee.

🙏 Citing GBSK

Cite our work if you use GBSK or the AGC100M dataset!

@article{chen2025GBSK,
  title={GBSK: Skeleton Clustering via Granular-Ball Computing and Multi-Sampling for Large-Scale Data},
  author={Yewang Chen and Junfeng Li and Shuyin Xia and Qinghong Lai and Xinbo Gao and Guoyin Wang and Dongdong Cheng and Yi Liu and Yi Wang},
  year={2025},
  eprint={2509.23742},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

For more information and updates, visit the GBSK GitHub repository.