Skip to content

project-terraforma/StatusNow

Repository files navigation

StatusNow — Place Status Classification

Classifies whether a POI is Open or Closed based on its digital footprint and recency signals from Overture Maps releases.


How It Works

StatusNow ingests consecutive monthly Overture Maps releases and trains a binary classifier to predict whether a place of interest is still open.

1. Closure labels from time series A place is labeled Closed (High-Quality Closed / HQC) if it appeared in two consecutive releases and then vanished from the next — a confirmed churner with trajectory history. A place is labeled Open if it appears in the latest release. With 4 monthly releases (Jan → Feb → Mar → Apr 2026) the pipeline builds 3 comparison windows, yielding 193,976 high-quality closed examples.

2. Feature engineering (65+ features, leak-free) Features are computed from the base snapshot (R_i) so no future data leaks into training:

  • Recency / staleness — days since last source update, staleness buckets, zombie score (sources ÷ avg staleness)
  • Digital presence — website, social, phone counts from the base snapshot
  • Delta signals — what changed between R_i and R_{i+1}: lost socials, gained website, identity changes (name / category / address)
  • Trajectory features (3+ releases) — consecutive_present, releases_seen, pre_closure_loss, social_trend

3. Ensemble model Four models trained on 421K rows with 5-fold OOF cross-validation: CatBoost-A/B/C + LightGBM-A. Ensemble weights and classification threshold are chosen on OOF predictions only — the hold-out (Chicago + Miami) is never touched until final evaluation.


Latest Model Results (V9 — Chicago + Miami Hold-out)

Metric Value
Balanced Accuracy 85.90%
OOF AUC 93.40%
Ensemble CatBoost-A × 0.8 + LightGBM-A × 0.2
Threshold 0.51

Evaluated on 63,754 hold-out rows (Chicago + Miami, never seen during training). Trained on 4 monthly releases: Jan → Feb → Mar → Apr 2026.

Model Hold-out Balanced Acc
CatBoost-B 85.96%
CatBoost-C 85.93%
CatBoost-A 85.92%
CB+LGBM ensemble 85.90%
LightGBM-A 85.83%

Top Features (CatBoost-A, V9):

Rank Feature Importance Description
1 recency_spread 29.4% Range between oldest and newest source update timestamps
2 zombie_score 16.2% Source count / avg staleness — "database purgatory" signal
3 identity_change_score 15.7% Sum of name, category, and address changes
4 recency_pca 11.7% PCA of recency metrics (fit on training rows only)
5 log_days 7.4% Log of days since most recent source update
6 source_has_msft 3.4% Microsoft / Bing as a data source
7 has_phone 3.2% Phone number present in base snapshot
8 is_brand 2.8% Place matches a known brand chain
9 category_primary 1.6% Business category (CatBoost native encoding)
10 has_facebook 1.3% Facebook social link present

V8 → V9 comparison:

Version Releases HQC Closed Hold-out Rows Balanced Acc
V8 Jan, Feb, Mar (3) 142,931 46,907 89.29%
V9 Jan, Feb, Mar, Apr (4) 193,976 63,754 85.90%

V9 adds April as a closure oracle and a third training pair (Mar→Apr). The −3.4 pp drop likely reflects distribution shift: Pair 2 contributes only 51K HQC closed (vs 142K in Pair 1), and April churners may have different characteristics than the Feb/Mar cohort.


Current Pipeline (V8 — HQC Labels + Full Leak Audit)

What's New

High-Quality Closed (HQC) labels + 60/40 rebalancing

  • Closed label now requires a place to be present in 2 consecutive past releases and absent in the next — confirmed churners with trajectory history.
  • All 142,931 HQC places kept (no cap). Open downsampled globally to 60/40.
  • Previous pipeline capped closed at 3,000/pair and discarded 97% of available signal.

Leak fixes applied on top of V7:

Issue Root Cause Fix
PCA fitted on full dataset recency_pca was computed before train/test split; hold-out data influenced PCA direction PCA now fit on training rows only in step3 (after split); days_latest/days_avg passed as passthrough columns
Hold-out used for optimisation Ensemble weights + threshold searched over 918 combinations against y_test, then reported as accuracy Weights and threshold now chosen via OOF (cross_val_predict on y_train); hold-out used only for final unbiased reporting
Single reference date across pairs Staleness computed against the newest release date for all rows; pair-0 places appeared ~28 days older than pair-1 with identical update dates Recency computed per release_date_current group so each pair uses its own prediction-window endpoint
Digital presence used post-event values has_website, num_socials, etc. used COALESCED (R_{i+1}) data for open places but R_i data for churned places — asymmetric measurement window All presence features now use base_* columns (R_i) for both classes
LightGBM missing category feature LGBM received numeric features only, missing the 7.7%-importance category_primary LabelEncoder fitted on training rows; LGBM receives category_encoded

V7 leak fixes (still applied):

Issue Fix
Double-encoded JSON zeroed digital presence, sources, recency CAST(AS VARCHAR) in step1 SQL
releases_seen=2 was a proxy for label=0 Anchor both future churners AND equal-size stable-open sample in pair-0
COALESCE-induced staleness asymmetry Staleness from base_sources for all places
Overture confidence signal Removed; 5 confidence-derived features dropped

V5 leak fixes (still applied):

Issue Fix
confidence NaN-fill → perfect closed signal Use base_confidence only; drop delta_confidence, confidence_momentum
category_churn_risk computed globally from all labels Removed; category_primary passed as CatBoost native categorical

Contributor Pipeline

Drop Overture release parquets into overture_releases/ (see overture_releases/README.md for naming convention) then run:

# Build release files from raw per-city parquets (one-time setup)
python scripts/data_processing/build_release_files.py

# Run the full pipeline: data → features → training
# Default holdout: Chicago + Miami; default balance: 60% open / 40% closed
python pipeline/run_pipeline.py

# Trained models → pipeline_output/models/

Key options:

python pipeline/run_pipeline.py \
  --holdout-cities chicago miami \
  --target-open-rate 0.6 \
  --cv-folds 5

With 4 releases (Jan → Feb → Mar → Apr 2026) the pipeline creates 3 comparison pairs and activates trajectory features (pre_closure_loss, social_trend, releases_seen, consecutive_present) that capture pre-closure behaviour — directly addressing the 2-release limitation where all delta features are 0 for churned places by construction.

See pipeline/README.md for the full guide.


V6 Agent Layer

Sits on top of the classifier and researches low-confidence predictions (default threshold: 0.65) via targeted web search + LLM verdict.

Mode Script Use Case
Sync (interactive) scripts/agent/main.py Approval-gated: review the research plan before execution
Async (high-throughput) scripts/agent/async_main.py 3 parallel research workers + live dashboard

Requires GROQ_API_KEY and TAVILY_API_KEY. See docs/v6_agent_architecture.md.


Repository Structure

StatusNow/
├── overture_releases/           ← Drop Overture parquet releases here
│   └── README.md
│
├── pipeline/                    ← Training pipeline (start here)
│   ├── run_pipeline.py          ← Single command to train a new model
│   ├── step1_build_training_data.py
│   ├── step2_feature_engineering.py
│   ├── step3_train.py
│   └── README.md
│
├── scripts/
│   ├── data_processing/
│   │   ├── build_release_files.py       ← Build overture_releases/ parquets
│   │   ├── fetch_overture_expanded.py   ← Fetch any city from Overture S3
│   │   ├── build_truth_expanded.py      ← Build + merge multi-city truth datasets
│   │   └── merge_cities.py
│   │
│   ├── experiments/
│   │   ├── v5_train_best.py             ← Train best model, export predictions
│   │   ├── v5_full_benchmark.py         ← Full CV + all models + ensemble search
│   │   ├── v6_enrichment_experiment.py
│   │   └── exp_predictive_labels.py     ← R2-oracle experiment (see below)
│   │
│   ├── agent/                           ← V6 AI agent layer
│   │   ├── main.py
│   │   ├── async_main.py
│   │   ├── config.py
│   │   ├── llm/interface.py
│   │   ├── ingest.py
│   │   ├── planner.py
│   │   ├── executor.py
│   │   └── schemas.py
│   │
│   ├── research/                        ← Research history (V3 → V5)
│   │   ├── README.md
│   │   ├── v5_holdout_eval.py
│   │   ├── process_data_v5.py
│   │   └── ...
│   │
│   └── archived/                        ← V1/V2 era scripts
│
└── data/
    └── combined_truth_dataset_expanded.parquet   ← V4 gold standard (123k rows, 12 cities)

Project History & Journey Summary

This section chronicles our progress from the initial baseline to the current pipeline.

Phase 1: V1 Delta Features (Baseline)

  • Goal: Establish a baseline using "Delta Features" (comparing historical baseline vs current data).
  • Method: Calculated net change in websites, socials, and phones.
  • Key Insight: has_gained_social (r=+0.26) was the strongest single predictor. has_any_loss (r=-0.17) was a reliable closure signal.
  • Result: 67.3% Balanced Accuracy. Knowing that something changed was good, but not enough.

Phase 2: V2 Advanced Engineering (Context)

  • Goal: Capture nuance with Interaction Features and PCA.
  • Innovation:
    • Zombie Score: Identified places with many sources but stale data ("Database Purgatory").
    • Category Risk: Modeled that gas stations close less often (10% churn) than boutiques (45% churn).
    • PCA: Reduced redundancy between correlated recency features (98% variance explained).
  • Result: 70.65% Balanced Accuracy. Temporal context ("when did it change?") proved critical.

Phase 3: V3 Label Refinement (Noise Reduction)

  • Goal: Tackle label noise in the manually labeled dataset.
  • Innovation: "Dynamic Label Refinement" using 5-fold cross-validation.
  • Findings: Identified 65 samples (2.2%) where the model was >90% confident the human label was wrong.
  • Result: Removing these likely errors boosted accuracy to 72.09%.

Phase 4: Overture Truth Dataset (The 93% Breakthrough)

  • Goal: Validate concepts on a larger, cleaner, ground-truth dataset.
  • Replication Method (Script: scripts/data_processing/build_truth_dataset.py):
    1. Fetch Data: Used fetch_overture_data.py to download places from Overture S3 (Jan 2026 vs Feb 2026) for NYC BBox.
    2. Define Closed: A place is considered closed if:
      • It existed in the Previous release but is missing ID in the Current release (churned).
      • OR it exists in Current but explicitly has operating_status = 'closed'.
    3. Define Open: Exists in Current and operating_status != 'closed'.
    4. Balance: Downsampled to 3k Open / 3k Closed to match Season 2 distribution.
  • Result: 92.87% Balanced Accuracy.
  • Major Lesson: The V3 features were highly effective, but the original dataset's noise and size were holding them back.
  • Warning: We discovered a massive performance gap between Brands (97% Accuracy) and Small Businesses (67% Accuracy), suggesting future work should treat them as separate problems.

Phase 5: San Francisco Expansion (Generalization)

  • Goal: Validate if the model works beyond NYC.
  • Method: Replicated the pipeline for San Francisco (SF) and created a combined dataset.
  • Results:
    • SF Accuracy: 91.39% (despite fewer closed samples).
    • Combined Model: 85.21% Balanced Accuracy on 18,619 samples.
  • Key Insight: The initial 95% result was inflated by a data leak (Confidence score). After fixing it, the model stabilized at ~85%, and uniquely, the Brand Gap disappeared (Brands vs Non-Brands now perform equally).

Phase 6: V4 Research — Leakage Audit + 12-City Expansion (Mar 2026)

  • Goal: Improve from 85% → 90% Balanced Accuracy.
  • Leakage Discovery: processed_for_ml_testing.parquet was built with confidence = 0 for 3,000 churned NYC places (NaN-fill bug). This gave the model a near-perfect closed signal — true leak-free baseline was 80.5%. category_churn_risk (computed globally from labels) also contributed minor leakage.
  • Strategy: Scale the dataset dramatically across diverse cities using Overture S3.
  • Data Expansion: Fetched 10 new US cities (Chicago, LA, Houston, Phoenix, Philadelphia, Seattle, Denver, Boston, Miami, Atlanta) → 123,082 samples from 12 cities.
  • V4 Features: Extended to 95 features — added identity-change signals (name_changed, website_domain_changed, identity_change_score), richer per-channel gain/loss flags, and interaction terms.
  • Results (leaky CV): CatBoost + LightGBM ensemble: 89.18%
  • Key Insight: More data >> better models. HPO added only ~0.1 pp; going from 12k → 123k added ~8.7 pp.

Phase 7: V5 Research — Full Leakage Fix + Geographic Hold-Out (Mar 2026)

  • Goal: Produce an honest, production-grade evaluation with all leakages fixed.
  • Leakage Audit:
    1. confidence NaN-fill: churned places (93.7% of closed) had confidence=null → filled with 0 → near-perfect closed signal. Fix: use base_confidence (Jan 2026 value) only. Drop delta_confidence and confidence_momentum.
    2. category_churn_risk computed globally from all 123k labels before CV → 0.50 correlation with target. Fix: removed; replaced with category_primary as CatBoost native categorical feature (fold-safe internal target encoding).
    3. Evaluation: all CV was on the same 12 cities. Fix: geographic hold-out — Chicago + Miami held out completely.
  • Data Architecture Insight: In the 2-release dataset (Jan 2026 = base, Feb 2026 = current), churned places (closed by disappearing) have current = COALESCE(null, prev) = prev, so all delta features are 0 by construction for 93.7% of closed places. This is a structural limitation of 2-release data. A 3rd release would provide legitimate pre-closure deltas.
  • Operating Status Note: operating_status = 'closed' appears in only 1–2 places per city in current Overture data. Closures are expressed as churning (disappearance between releases), not explicit status flags. Using operating_status alone as the closed label is not viable with current Overture data.
  • Results: CB+LGBM ensemble on Chicago + Miami hold-out: 89.41% (w_CB=0.7, thresh=0.52).
  • Scripts: scripts/research/process_data_v5.py, scripts/research/v5_holdout_eval.py.

Phase 8: V7 — 3rd Release, Trajectory Features, Full Leak Audit (Mar 2026)

  • Goal: Break the 2-release structural ceiling (all delta features = 0 for churned places) and fix remaining data leaks.
  • 3rd Release: Added Overture 2026-03-18.0 for all 12 cities via scripts/data_processing/build_release_files.py. With 3 releases → 2 consecutive comparison pairs → trajectory features activated.
  • Leak Fixes:
    1. Double-encoded JSON (to_json() on VARCHAR columns): all digital presence, sources, and recency features were silently zeroed out. Fix: CAST(AS VARCHAR) in step1 SQL.
    2. Constructed releases_seen leak: only future churners were force-included in pair 0's open set, making releases_seen=2 a near-perfect proxy for label=0. Fix: also anchor a matching sample of future non-churners so releases_seen=2 occurs for both classes.
    3. COALESCE-induced staleness leak: log_days was computed from the COALESCED sources column. Closed places (sources from prior release) appeared more stale than open places (sources from current release) by construction. Fix: compute staleness from base_sources for all places.
    4. Overture confidence removed: 5 confidence-derived features dropped (external quality signal with unclear provenance).
  • City column propagated: _city from release parquets flows through step1 → step2 → step3, enabling city-name holdout (default: Chicago + Miami).
  • Results: CatBoost-C: CV 97.15%, hold-out 97.00%. Top features: recency_spread (19.6%), zombie_score (16.7%), recency_pca (11.6%), log_days (9.3%).
  • Key Insight: The 89.41% V5 result was partially suppressed by silently zeroed features (the JSON double-encoding bug was present from the start). The true signal in Overture recency metadata is much stronger than previously measured.

Phase 10: R2-Oracle Experiment — Genuine Signal Validation (Mar 2026)

  • Question: Is the model learning real closure signals, or just detecting "this place is absent from the latest release?"
  • Setup: Features built from R0→R1 window only (Jan→Feb). R2 (Mar) used exclusively as a label oracle — label=1 if present in R2, label=0 (HQC) if in R0+R1 but not R2. R2 data never touches the feature matrix.
  • Result:
Metric Current Pipeline (R2 in features) Experiment (R2 labels only) Delta
Balanced Accuracy 89.29% 71.02% −18.3 pp
AUC 95.30% 80.28% −15.0 pp
  • Conclusion: The significant drop confirms the model is not simply memorising R2 presence. Genuine predictive signals exist in the R0→R1 feature window (name changes, digital presence shifts, source volatility, recency). The additional ~18 pp in the current pipeline comes from the multi-release feature window giving the model more temporal evidence — not from target leakage. Script: scripts/experiments/exp_predictive_labels.py.

Phase 11: V9 — April 2026 Release + 4-Release Time Series (Apr 2026, Current)

  • Goal: Extend the pipeline to 4 consecutive monthly releases, using April as the closure oracle and Jan–Mar as the training window. This gives genuine 3-month trajectory data for every closed place.
  • New release: 2026-04-15.0 fetched from Overture S3 for all 12 cities (1,708,170 rows). Placed in overture_releases/ alongside Jan, Feb, and Mar.
  • Pipeline now creates 3 comparison pairs:
    • Pair 0 (Jan→Feb): open rows only (no prior history for HQC)
    • Pair 1 (Feb→Mar): 142,931 HQC closed + open
    • Pair 2 (Mar→Apr): 51,045 HQC closed + open ← April used as closure signal
  • Dataset: 484,940 rows after 60/40 global rebalancing (193,976 closed, 290,964 open). Train: 421,186 / Hold-out: 63,754 (Chicago + Miami).
  • Results: CB+LGBM ensemble hold-out 85.90% balanced accuracy (OOF AUC 93.40%). Individual models cluster tightly at 85.83–85.96%. Top features remain recency_spread, zombie_score, and identity_change_scorerecency_pca dropped from rank 2 to rank 4 as the staleness interaction features gained relative weight with the longer time window. The −3.4 pp drop vs V8 is consistent with distribution shift: the Mar→Apr pair contributes only 51K HQC closed vs 142K from Feb→Mar, and April churners may behave differently than the prior cohort.

Phase 9: V8 — HQC Labels + Remaining Leak Fixes (Mar 2026)

  • Goal: Tighten the closed label definition, fix remaining leaks found in a full audit, and improve dataset balance.
  • HQC Closed Labels: Redefined closed as places present in 2 consecutive past releases and absent in the next — confirmed churners with trajectory history. This yields 142,931 high-quality closed examples vs. the old 3,000/pair cap that discarded 97% of available signal. Dataset rebalanced globally to 60/40 (357k rows total).
  • Leak Fixes:
    1. PCA fitted on full dataset: recency_pca was computed before the train/test split, so hold-out data influenced the PCA direction. Fix: PCA now fit on training rows only in step3 after the split; days_latest/days_avg passed as passthrough columns from step2.
    2. Hold-out used for optimisation: ensemble weights and threshold were searched over 918 combinations against y_test, then reported as the hold-out accuracy — a form of test-set overfitting. Fix: weights and threshold now chosen via OOF predictions (cross_val_predict on y_train); hold-out used only for final unbiased reporting.
    3. Single reference date across pairs: staleness was computed against the newest release date for all rows, making pair-0 places appear ~28 days older than pair-1 places with identical update dates. Fix: recency computed per release_date_current group.
    4. Digital presence used post-event values: has_website, num_socials, etc. used COALESCED (R_{i+1}) data for open places but R_i data for churned places — an asymmetric measurement window. Fix: all presence features now use base_* columns (R_i) for both classes.
    5. LightGBM missing category feature: LGBM received numeric features only, missing the 7.7%-importance category_primary. Fix: LabelEncoder fitted on training rows; LGBM receives category_encoded.

About

Open or Closed? ML Model Pipeline for multiple releases + AI Agent with async and sync mode

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages