Skip to content

Mullassery/ClusterAudienceKit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClusterAudienceKit

A Python library for customer segmentation in marketing data pipelines — RFM analysis, clustering, segment profiling, and streaming updates in one import.

ClusterAudienceKit is a Python library that replaces the scikit-learn + pandas + lifetimes stack for customer segmentation. If you've built this before, you've probably written hundreds of lines of boilerplate glue and still ended up with a pipeline that can't handle 100k customers in any reasonable time. ClusterAudienceKit does it in a single import, backed by a Rust engine.

License: MIT Python PyPI

Install

Pick one:

pip install clusteraudiencekit

OR

uv add clusteraudiencekit

OR

curl -sSfL https://raw.githubusercontent.com/Mullassery/ClusterAudienceKit/main/install.sh | sh

Pre-built wheels for all platforms: INSTALL.md

Get started in 10 lines

from clusteraudiencekit import AudienceSegmenter
import pandas as pd

# Required columns: customer_id, transaction_date, amount
transactions = pd.read_csv('transactions.csv')

segmenter = AudienceSegmenter(method='rfm_kmeans', n_clusters=4)
segmenter.fit(transactions)

segments = segmenter.predict(transactions)
profiles = segmenter.segment_profiles()
print(profiles)
#   segment | size  | avg_recency | avg_frequency | avg_monetary
#   0       | 250k  | 15.3 days   | 8.2 purchases | $450   <- high-value loyalists
#   1       | 180k  | 45.2 days   | 3.1 purchases | $120   <- regular buyers
#   2       | 320k  | 2.1 days    | 2.0 purchases | $80    <- new / recent
#   3       | 250k  | 60.5 days   | 1.0 purchases | $30    <- at-risk / dormant

print(f"Silhouette score: {segmenter.silhouette_score():.3f}")

Why not just use scikit-learn?

You can — until your audience grows. sklearn.metrics.silhouette_score is O(n²): at 100k customers it takes over 2.7 hours. At 1M customers it won't finish. ClusterAudienceKit handles both in under half a second.

Measured timings on Apple M1 (sklearn 1.6.1, pandas 3.0.3):

Customer base sklearn + pandas ClusterAudienceKit
1,000 38ms <9ms
10,000 606ms <37ms
100,000 >2.7 hours <130ms
1,000,000 Did not complete <470ms

Beyond performance, you also get RFM scoring, segment profiling, drift detection, and streaming updates — none of which scikit-learn or pandas provide out of the box.

Capability scikit-learn pandas lifetimes ClusterAudienceKit
RFM calculation manual
KMeans clustering
K-Prototypes (mixed data)
Marketing segment profiles manual
Silhouette + quality metrics
Streaming / incremental updates
Segment drift detection
Save / load model state
Multi-core by default partial
Customer lifetime value (CLV) planned

Full comparison with code examples: docs/comparison.md · Full benchmark methodology: BENCHMARKS.md

Segmentation methods

RFM + KMeans

Scores each customer on Recency, Frequency, and Monetary value, then groups them with KMeans. The standard approach for most Martech teams.

segmenter = AudienceSegmenter(method='rfm_kmeans', n_clusters=4)
segmenter.fit(df)

RFM + K-Prototypes

Extends RFM with categorical attributes — acquisition channel, product category, region — so your segments reflect more than just spend behaviour.

segmenter = AudienceSegmenter(method='rfm_kprototypes', n_clusters=5)
segmenter.fit(df, categorical_columns=['channel', 'region', 'product_category'])

Streaming updates

Update segments incrementally as daily events arrive, without reprocessing your full customer history. Detect and react to campaign-driven drift:

segmenter.fit(historical_data)

for daily_events in event_stream:
    segmenter.update(daily_events)

    stability = segmenter.segment_stability(previous_segments)
    if stability < 0.85:
        segmenter.fit(all_data, refit=True)

    previous_segments = segmenter.predict(customers)

PySpark integration

Use ClusterAudienceKit with Apache Spark DataFrames for large-scale customer segmentation on distributed clusters.

from pyspark.sql import SparkSession
import polars as pl
from clusteraudiencekit import AudienceSegmenter

spark = SparkSession.builder.appName("audience-segmentation").getOrCreate()

# Load customer transaction data from Spark
spark_df = spark.read.parquet("s3://bucket/transactions/")

# Convert to Polars for segmentation (small-scale, in-memory)
polars_df = spark_df.select("customer_id", "purchase_amount", "purchase_date") \
    .toPandas()
polars_df = pl.from_pandas(polars_df)

# Fit segmentation model
segmenter = AudienceSegmenter(method='rfm_kmeans', n_clusters=5)
segmenter.fit(polars_df)

# Get segment assignments
segments = segmenter.predict(polars_df)

# Write segments back to Spark
segments_df = spark.createDataFrame(
    segments.to_pandas(),
    schema=["customer_id", "segment"]
)
segments_df.write.mode("overwrite").parquet("s3://bucket/segments/")

print(f"Segmented {segments_df.count()} customers into {segmenter.n_clusters} segments")

Note: For very large datasets, consider:

  • Sampling/filtering in Spark before converting to Polars
  • Running segmentation on aggregated RFM scores per customer (reduces memory footprint)
  • Caching the Polars DataFrame if running multiple predictions

Configuration

AudienceSegmenter(
    method='rfm_kmeans',        # 'rfm_kmeans' | 'rfm_kprototypes' | 'kmeans_only'
    n_clusters=4,               # number of segments
    recency_window_days=90,     # lookback window in days
    decay_function='linear',    # 'linear' | 'exponential' | 'inverse'
    decay_half_life_days=30,    # half-life for exponential decay
    frequency_threshold=1,      # minimum transactions to include a customer
    monetary_threshold=0.0,     # minimum spend to include a customer
    random_state=42,
    n_jobs=-1,                  # -1 = all cores
)

Documentation

INSTALL.md pip, uv, and pre-built wheel installation
docs/api-reference.md All 13 methods
docs/getting-started-simple.md Guide for non-technical marketing teams
docs/comparison.md Side-by-side vs sklearn / pandas / lifetimes
BENCHMARKS.md Benchmark methodology and raw results
docs/troubleshooting.md Common errors
docs/architecture.md Design decisions
examples/ Runnable scripts

Roadmap

Segmentation

  • Customer lifetime value (CLV) — BG/NBD and Gamma-Gamma models, matching lifetimes parity
  • DBSCAN and HDBSCAN — density-based clustering for audiences with irregular shapes
  • Hierarchical clustering — dendrogram output for exploratory segment discovery
  • Auto-cluster selection — silhouette + elbow method to recommend optimal n_clusters
  • Geographic segmentation — cluster on lat/lon fields with haversine distance

RFM and features

  • Engagement RFM — adapt RFM for non-transactional signals (email opens, app sessions, ad clicks)
  • Weighted RFM — configurable weights per dimension rather than equal thirds
  • Custom feature columns — include arbitrary numeric columns alongside RFM in clustering

Pipeline integrations

  • dbt macro — expose segment_profiles() as a dbt model after each run
  • Airflow operator — AudienceSegmenterOperator for scheduled retraining
  • Kafka input — ingest streaming events and update segments without batch jobs
  • Export to CRM — direct push of segment assignments to Salesforce, HubSpot, Braze

Ad platform exports

  • Google Ads — upload segment lists to Google Customer Match (CRM, engagement, similar audiences)
  • Meta Ads — export to Meta Conversions API and Audience Manager (pixel events, custom audiences)
  • Amazon Ads — push audiences to Amazon DSP and sponsored ads platform
  • TikTok Ads — segment export to TikTok Custom Audience API
  • LinkedIn Ads — matched audience export for B2B account-based marketing

Data warehouse & CDP

  • Segment.com integration — push segments to 100+ downstream tools via Segment protocol
  • mParticle — native export with segment attributes and metadata
  • Customer Data Platforms (Treasure Data, Tealium, Lytics) — streaming and batch sync
  • Snowflake native app — segment creation and management within Snowflake UI

Marketing automation & email

  • Klaviyo — export audiences for email campaigns, SMS, push notifications with segment traits
  • Mailchimp — segment sync with dynamic tag assignment
  • SendGrid — audience upload for email marketing workflows
  • Iterable — streaming segment membership updates for triggered campaigns

Advanced audience features

  • Lookalike audience generation — find similar high-value customers outside current segments
  • Segment expansion — identify upsell/cross-sell opportunities within each segment
  • Churn prediction within segments — risk-score customers by segment for retention campaigns
  • Real-time segment API — serve customer_id → segment membership via REST/gRPC

Privacy & compliance

  • Differential privacy — add noise to segments to protect individual privacy
  • K-anonymity enforcement — ensure segments contain at least k customers
  • GDPR/CCPA cleanup — auto-remove opted-out customer IDs from segments
  • Audit logs — track all segment exports and access for compliance

Output and observability

  • Segment naming — auto-label segments ("high-value loyalists", "at-risk") based on profile stats
  • Cohort tracking — compare how individual customers move between segments over time
  • HTML segment report — shareable one-page visual summary for marketing teams
  • Prometheus metrics — expose segment health and drift as scrapeable endpoints

Contributing

Bug reports and feature requests: GitHub Issues
Questions and discussion: GitHub Discussions
Pull requests: read CONTRIBUTING.md first.

Author

Georgi Mammen Mullasserygithub.com/Mullassery

License

MIT

About

Python library for customer segmentation in Martech pipelines — RFM analysis, clustering, streaming updates, and drift detection in a single pip install.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors