A Python library for customer segmentation in marketing data pipelines — RFM analysis, clustering, segment profiling, and streaming updates in one import.
ClusterAudienceKit is a Python library that replaces the scikit-learn + pandas + lifetimes stack for customer segmentation. If you've built this before, you've probably written hundreds of lines of boilerplate glue and still ended up with a pipeline that can't handle 100k customers in any reasonable time. ClusterAudienceKit does it in a single import, backed by a Rust engine.
Pick one:
pip install clusteraudiencekitOR
uv add clusteraudiencekitOR
curl -sSfL https://raw.githubusercontent.com/Mullassery/ClusterAudienceKit/main/install.sh | shPre-built wheels for all platforms: INSTALL.md
from clusteraudiencekit import AudienceSegmenter
import pandas as pd
# Required columns: customer_id, transaction_date, amount
transactions = pd.read_csv('transactions.csv')
segmenter = AudienceSegmenter(method='rfm_kmeans', n_clusters=4)
segmenter.fit(transactions)
segments = segmenter.predict(transactions)
profiles = segmenter.segment_profiles()
print(profiles)
# segment | size | avg_recency | avg_frequency | avg_monetary
# 0 | 250k | 15.3 days | 8.2 purchases | $450 <- high-value loyalists
# 1 | 180k | 45.2 days | 3.1 purchases | $120 <- regular buyers
# 2 | 320k | 2.1 days | 2.0 purchases | $80 <- new / recent
# 3 | 250k | 60.5 days | 1.0 purchases | $30 <- at-risk / dormant
print(f"Silhouette score: {segmenter.silhouette_score():.3f}")You can — until your audience grows. sklearn.metrics.silhouette_score is O(n²): at 100k customers it takes over 2.7 hours. At 1M customers it won't finish. ClusterAudienceKit handles both in under half a second.
Measured timings on Apple M1 (sklearn 1.6.1, pandas 3.0.3):
| Customer base | sklearn + pandas | ClusterAudienceKit |
|---|---|---|
| 1,000 | 38ms | <9ms |
| 10,000 | 606ms | <37ms |
| 100,000 | >2.7 hours | <130ms |
| 1,000,000 | Did not complete | <470ms |
Beyond performance, you also get RFM scoring, segment profiling, drift detection, and streaming updates — none of which scikit-learn or pandas provide out of the box.
| Capability | scikit-learn | pandas | lifetimes | ClusterAudienceKit |
|---|---|---|---|---|
| RFM calculation | — | manual | — | ✓ |
| KMeans clustering | ✓ | — | — | ✓ |
| K-Prototypes (mixed data) | — | — | — | ✓ |
| Marketing segment profiles | — | manual | — | ✓ |
| Silhouette + quality metrics | ✓ | — | — | ✓ |
| Streaming / incremental updates | — | — | — | ✓ |
| Segment drift detection | — | — | — | ✓ |
| Save / load model state | — | — | ✓ | ✓ |
| Multi-core by default | partial | — | — | ✓ |
| Customer lifetime value (CLV) | — | — | ✓ | planned |
Full comparison with code examples: docs/comparison.md · Full benchmark methodology: BENCHMARKS.md
Scores each customer on Recency, Frequency, and Monetary value, then groups them with KMeans. The standard approach for most Martech teams.
segmenter = AudienceSegmenter(method='rfm_kmeans', n_clusters=4)
segmenter.fit(df)Extends RFM with categorical attributes — acquisition channel, product category, region — so your segments reflect more than just spend behaviour.
segmenter = AudienceSegmenter(method='rfm_kprototypes', n_clusters=5)
segmenter.fit(df, categorical_columns=['channel', 'region', 'product_category'])Update segments incrementally as daily events arrive, without reprocessing your full customer history. Detect and react to campaign-driven drift:
segmenter.fit(historical_data)
for daily_events in event_stream:
segmenter.update(daily_events)
stability = segmenter.segment_stability(previous_segments)
if stability < 0.85:
segmenter.fit(all_data, refit=True)
previous_segments = segmenter.predict(customers)Use ClusterAudienceKit with Apache Spark DataFrames for large-scale customer segmentation on distributed clusters.
from pyspark.sql import SparkSession
import polars as pl
from clusteraudiencekit import AudienceSegmenter
spark = SparkSession.builder.appName("audience-segmentation").getOrCreate()
# Load customer transaction data from Spark
spark_df = spark.read.parquet("s3://bucket/transactions/")
# Convert to Polars for segmentation (small-scale, in-memory)
polars_df = spark_df.select("customer_id", "purchase_amount", "purchase_date") \
.toPandas()
polars_df = pl.from_pandas(polars_df)
# Fit segmentation model
segmenter = AudienceSegmenter(method='rfm_kmeans', n_clusters=5)
segmenter.fit(polars_df)
# Get segment assignments
segments = segmenter.predict(polars_df)
# Write segments back to Spark
segments_df = spark.createDataFrame(
segments.to_pandas(),
schema=["customer_id", "segment"]
)
segments_df.write.mode("overwrite").parquet("s3://bucket/segments/")
print(f"Segmented {segments_df.count()} customers into {segmenter.n_clusters} segments")Note: For very large datasets, consider:
- Sampling/filtering in Spark before converting to Polars
- Running segmentation on aggregated RFM scores per customer (reduces memory footprint)
- Caching the Polars DataFrame if running multiple predictions
AudienceSegmenter(
method='rfm_kmeans', # 'rfm_kmeans' | 'rfm_kprototypes' | 'kmeans_only'
n_clusters=4, # number of segments
recency_window_days=90, # lookback window in days
decay_function='linear', # 'linear' | 'exponential' | 'inverse'
decay_half_life_days=30, # half-life for exponential decay
frequency_threshold=1, # minimum transactions to include a customer
monetary_threshold=0.0, # minimum spend to include a customer
random_state=42,
n_jobs=-1, # -1 = all cores
)| INSTALL.md | pip, uv, and pre-built wheel installation |
| docs/api-reference.md | All 13 methods |
| docs/getting-started-simple.md | Guide for non-technical marketing teams |
| docs/comparison.md | Side-by-side vs sklearn / pandas / lifetimes |
| BENCHMARKS.md | Benchmark methodology and raw results |
| docs/troubleshooting.md | Common errors |
| docs/architecture.md | Design decisions |
| examples/ | Runnable scripts |
Segmentation
- Customer lifetime value (CLV) — BG/NBD and Gamma-Gamma models, matching
lifetimesparity - DBSCAN and HDBSCAN — density-based clustering for audiences with irregular shapes
- Hierarchical clustering — dendrogram output for exploratory segment discovery
- Auto-cluster selection — silhouette + elbow method to recommend optimal
n_clusters - Geographic segmentation — cluster on lat/lon fields with haversine distance
RFM and features
- Engagement RFM — adapt RFM for non-transactional signals (email opens, app sessions, ad clicks)
- Weighted RFM — configurable weights per dimension rather than equal thirds
- Custom feature columns — include arbitrary numeric columns alongside RFM in clustering
Pipeline integrations
- dbt macro — expose
segment_profiles()as a dbt model after each run - Airflow operator —
AudienceSegmenterOperatorfor scheduled retraining - Kafka input — ingest streaming events and update segments without batch jobs
- Export to CRM — direct push of segment assignments to Salesforce, HubSpot, Braze
Ad platform exports
- Google Ads — upload segment lists to Google Customer Match (CRM, engagement, similar audiences)
- Meta Ads — export to Meta Conversions API and Audience Manager (pixel events, custom audiences)
- Amazon Ads — push audiences to Amazon DSP and sponsored ads platform
- TikTok Ads — segment export to TikTok Custom Audience API
- LinkedIn Ads — matched audience export for B2B account-based marketing
Data warehouse & CDP
- Segment.com integration — push segments to 100+ downstream tools via Segment protocol
- mParticle — native export with segment attributes and metadata
- Customer Data Platforms (Treasure Data, Tealium, Lytics) — streaming and batch sync
- Snowflake native app — segment creation and management within Snowflake UI
Marketing automation & email
- Klaviyo — export audiences for email campaigns, SMS, push notifications with segment traits
- Mailchimp — segment sync with dynamic tag assignment
- SendGrid — audience upload for email marketing workflows
- Iterable — streaming segment membership updates for triggered campaigns
Advanced audience features
- Lookalike audience generation — find similar high-value customers outside current segments
- Segment expansion — identify upsell/cross-sell opportunities within each segment
- Churn prediction within segments — risk-score customers by segment for retention campaigns
- Real-time segment API — serve
customer_id → segmentmembership via REST/gRPC
Privacy & compliance
- Differential privacy — add noise to segments to protect individual privacy
- K-anonymity enforcement — ensure segments contain at least k customers
- GDPR/CCPA cleanup — auto-remove opted-out customer IDs from segments
- Audit logs — track all segment exports and access for compliance
Output and observability
- Segment naming — auto-label segments ("high-value loyalists", "at-risk") based on profile stats
- Cohort tracking — compare how individual customers move between segments over time
- HTML segment report — shareable one-page visual summary for marketing teams
- Prometheus metrics — expose segment health and drift as scrapeable endpoints
Bug reports and feature requests: GitHub Issues
Questions and discussion: GitHub Discussions
Pull requests: read CONTRIBUTING.md first.
Georgi Mammen Mullassery — github.com/Mullassery