Predict whether a business is OPEN or CLOSED using a multi-source signal stack, external enrichment, and an MCC-optimized XGBoost pipeline.
- Project Overview
- Directory Structure
- Setup Instructions
- Running the Pipeline
- API Integration
- Features & Model
- Output Files
- Troubleshooting
| Aspect | Details |
|---|---|
| Project Name | Project Finale (Project C) |
| Goal | Predict business operational status (open/closed) |
| Primary Metric | Matthews Correlation Coefficient (MCC) |
| Approach | Multi-source enrichment + leak-aware XGBoost classification |
| Data Sources | Overture (embedded) + Google Places + Foursquare OS + OpenCorporates + Microsoft signals |
| Model | XGBoost with calibration, 5-fold stratified cross-validation |
-
Multi-source enrichment pipeline
- Google Places API: Real-time business status
- OpenCorporates: Legal entity status & incorporation dates
- Microsoft signals: Data freshness & staleness indicators
-
Leak-aware feature engineering
- Temporal recency signals (update dates, staleness)
- Presence signals (websites, phones, socials, branding)
- Geographic and category features
-
MCC-optimized threshold tuning
- Sweeps decision thresholds 0.1β0.7
- Selects best threshold based on 5-fold OOF MCC
- Reduces false positives in class-imbalanced data
-
Model calibration
- Isotonic calibration on full training set
- Probability outputs suitable for downstream decision-making
Project-Finale/
βββ data/ # Raw and processed datasets
β βββ raw/ # Original data files
β β βββ project_c_samples.parquet
β β βββ sample-open-prediction.parquet
β βββ processed/ # Processed datasets (after enrichment)
β
βββ src/ # Core source code
β βββ train_competition.py # Train XGBoost on labeled data
β βββ predict.py # Inference on sample-open-prediction
β βββ __init__.py
β
βββ api/ # External API integrations
β βββ external_enrichment.py # Google + OpenCorporates + Foursquare enrichment
β βββ foursquare_enrichment.py # Foursquare-only CLI
β βββ foursquare_data.py # Foursquare dataset downloader
β
βββ run_pipeline.py # Orchestrator: enrich β train β predict
β
βββ utils/ # Shared utilities
β βββ paths.py # Centralized path management
β βββ schema.py # record_id / names normalization
β βββ features.py # Overture + enrichment feature engineering
β βββ foursquare_enrich.py # Foursquare DuckDB enrichment
β βββ api_utils.py # API helpers (OpenCorporates, Google Places)
β βββ __init__.py
β
βββ models/ # Saved models and checkpoints
β βββ competition_model.json # XGBoost model (after training)
β
βββ outputs/ # Results and artifacts
β βββ predictions_sample_open.csv # Predictions for sample-open-prediction
β βββ enrichment_train.parquet # Enrichment for training set
β βββ enrichment_predict.parquet # Enrichment for prediction set
β βββ artifacts.json # Inference threshold + encoders
β βββ enrichment_combined.parquet # Legacy combined enrichment path
β βββ opencorp_features.parquet # OpenCorporates enrichment results
β βββ google_features.parquet # Google Places enrichment results
β βββ mcc_threshold_curve.png # MCC sweep visualization
β βββ shap_importance.png # Feature importance
β βββ shap_beeswarm.png # SHAP interaction plot
β βββ metrics_competition.json # Training metrics & config
β
βββ configs/ # Configuration files
β βββ (API keys stored locally, not in git)
β
βββ tests/ # Test and diagnostic scripts
β βββ diagnose_extraction.py # Data structure analysis
β βββ diagnose_api.py # API diagnostic
β βββ test_google_places.py # Google Places API test
β βββ validate_fixes.py # Validation checks
β
βββ notebooks/ # Jupyter notebooks (if any)
β
βββ fsq_data/ # Foursquare dataset (downloaded on demand)
β βββ release/
β βββ dt=2026-05-14/
β
βββ .git/ # Version control
βββ .venv/ # Python virtual environment
βββ LICENSE # License file
βββ README.md # This file
βββ requirements.txt # Python dependencies
- Python 3.9+
- Git
- pip or conda
git clone https://github.com/your-repo/project-finale.git
cd project-finale# Using venv
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# or
.venv\Scripts\activate # Windows
# Or using conda
conda create -n project-c python=3.10
conda activate project-cpip install -r requirements.txtKey packages:
pandas==1.5.0
xgboost==1.7.0
scikit-learn==1.2.0
shap==0.41.0
matplotlib==3.6.0
requests==2.28.0
rapidfuzz==2.14.0
pyarrow==10.0.0
shapely==2.0.0
huggingface_hub==0.11.0
Create a .env file in the project root (add to .gitignore):
# .env
GOOGLE_PLACES_API_KEY=your_api_key_here
OPENCORP_API_KEY=your_api_key_hereOr export in terminal:
export GOOGLE_PLACES_API_KEY="your_key"
export OPENCORP_API_KEY="your_key"Place both parquet files under data/:
data/project_c_samples.parquetβ labeled training data (Overture schema)data/sample-open-prediction.parquetβ businesses to predict
Trains on labeled data, predicts open/closed for sample-open-prediction.parquet:
export GOOGLE_PLACES_API_KEY="your_key" # optional but recommended
python run_pipeline.py --google-key $GOOGLE_PLACES_API_KEYFlags: --skip-google, --skip-foursquare, --fsq-skip-download, --train-only, --predict-only.
# 1. Enrich training rows (Google + OpenCorporates + Foursquare)
python api/external_enrichment.py \
--data data/project_c_samples.parquet \
--enrichment-out outputs/enrichment_train.parquet \
--google-key $GOOGLE_PLACES_API_KEY
# 2. Enrich prediction rows
python api/external_enrichment.py \
--data data/sample-open-prediction.parquet \
--enrichment-out outputs/enrichment_predict.parquet \
--google-key $GOOGLE_PLACES_API_KEY \
--fsq-skip-download
# 3. Train on labeled data
python src/train_competition.py \
--data data/project_c_samples.parquet \
--enrichment outputs/enrichment_train.parquet \
--n-folds 5
# 4. Predict
python src/predict.py \
--data data/sample-open-prediction.parquet \
--enrichment outputs/enrichment_predict.parquetOutputs:
| File | Description |
|---|---|
outputs/enrichment_train.parquet |
External features for training set |
outputs/enrichment_predict.parquet |
External features for prediction set |
models/competition_model.json |
Trained XGBoost model |
outputs/artifacts.json |
Threshold + encoders for inference |
outputs/predictions_sample_open.csv |
Final record_id, open, probability_open |
python src/train_competition.py \
--data data/project_c_samples.parquet \
--n-folds 3python api/external_enrichment.py \
--data data/raw/project_c_samples.parquet \
--google-key YOUR_KEY \
--skip-opencorppython api/external_enrichment.py \
--data data/raw/project_c_samples.parquet \
--skip-googlepython api/external_enrichment.py \
--data data/raw/project_c_samples.parquet \
--google-key YOUR_GOOGLE_KEY \
--opencorp-key YOUR_OC_KEYpython api/external_enrichment.py \
--data data/raw/project_c_samples.parquet \
--google-key YOUR_KEY \
--debugProvides: Real-time business status (OPERATIONAL, CLOSED_PERMANENTLY, CLOSED_TEMPORARILY)
Setup:
- Go to Google Cloud Console
- Create a new project
- Enable the "Places API"
- Create an API key (restrict to Places API)
- Use:
--google-key YOUR_KEY
Cost: ~$0.017 per request (2 requests per record = ~$340 for 10K records at 200 free tier)
Troubleshooting:
- 401 Unauthorized: API key invalid or API not enabled
- 429 Rate Limited: Automatic backoff in script; wait and retry
- 0 matches: Business not indexed by Google; common for small/new businesses
Provides: Legal entity status (active/dissolved), incorporation date, company age
Setup:
- Go to OpenCorporates.com
- Create an account and get an API key (optional for free tier)
- Use:
--opencorp-key YOUR_KEY(for higher rate limits)
Cost: Free tier available (free tier is ~4 requests/second)
Troubleshooting:
- No API key needed for free tier, but rate limited
- 401 Unauthorized: Invalid API key
- 0 matches: Company not in OpenCorporates database (common for non-US entities)
Download the full Foursquare Open Squares dataset:
python api/foursquare_data.pyRequires Hugging Face token (set HF_TOKEN environment variable for private datasets).
Temporal features:
msft_age_days: Days since Microsoft last updated (staleness signal)days_since_latest_update: Latest update across all sourcesupdate_span_days: Time span between oldest and latest update
Presence features:
has_phone,has_website,has_social: Binary presencepresence_score: Sum of all presence signals (0β4)presence_ratio: Presence completeness (0β1)
Contact completeness:
addr_completeness: Address field count (0β4)missing_count: Total missing metadata (0β5)
Naming features:
name_len: Character lengthname_words: Word countname_has_digit: Contains numbersname_has_hash: Contains "#" (store numbers)
Geographic:
lat,lon: Coordinates (rounded)geo_density: Business density by lat/lon grid
Category:
primary_cat_enc: Encoded primary categoryalt_cat_count: Alternate category count
External signals (from enrichment APIs):
google_is_permanently_closed: Google status = CLOSED_PERMANENTLY ββgoogle_is_operational: Google status = OPERATIONALoc_is_dissolved: Legal status from OpenCorporates = dissolved βgoogle_signal_strength: Match confidence Γ name scoregoogle_engagement_score: Log(rating_count) Γ rating
n_estimators=600
max_depth=5
learning_rate=0.04
subsample=0.8
colsample_bytree=0.75
scale_pos_weight=<neg/pos ratio> # Handles class imbalance- Metric: Matthews Correlation Coefficient (MCC)
- Threshold sweep: 0.1β0.7 in 0.05 increments
- Selection: Best threshold from 5-fold OOF MCC
- Calibration: Isotonic calibration on full train set
| File | Contents |
|---|---|
outputs/opencorp_features.parquet |
OpenCorporates results (OC status, age, etc.) |
outputs/google_features.parquet |
Google Places results (business status, rating, etc.) |
outputs/enrichment_combined.parquet |
Merged enrichment features (join on record_id) |
| File | Contents |
|---|---|
outputs/submission.csv |
Predictions: record_id, open ("open"/"closed") |
models/competition_model.json |
Trained XGBoost model (saveable/loadable) |
outputs/metrics_competition.json |
Training metrics, features, thresholds, config |
outputs/mcc_threshold_curve.png |
MCC vs decision threshold (with Β±1 std envelope) |
outputs/shap_importance.png |
Feature importance (mean |SHAP|) |
outputs/shap_beeswarm.png |
SHAP interaction plot (top 20 features) |
{
"n_train": 5000,
"n_val": 2000,
"n_features": 43,
"features": ["confidence", "source_count", ...],
"class_balance_train": {"open": 3200, "closed": 1800},
"best_threshold": 0.35,
"oof_mcc": 0.4567,
"mcc_by_threshold": {
"0.10": 0.3421,
"0.15": 0.4123,
...
"0.35": 0.4567
},
"top_10_features_by_shap": [
"google_is_permanently_closed",
"msft_age_days",
...
],
"model_params": {...},
"has_external_signal": false
}python tests/diagnose_extraction.pyDisplays:
- Column names, dtypes, shapes
- Names extraction quality
- Region extraction & state mapping
- API gate pass rate
python tests/validate_fixes.py --opencorp-key YOUR_KEYTests:
- Data preprocessing
- Name and region extraction
- OpenCorporates API connectivity
- Sample API call
python tests/test_google_places.pyTests first 10 Google Places API calls (text search + place details).
python src/train_competition.py \
--data data/raw/my_dataset.parquet \
--n-folds 10 \
--seed 123 \
--out outputs/python api/external_enrichment.py \
--data data/raw/my_data.parquet \
--google-key YOUR_KEY \
--google-limit 5000 \
--conf-max 0.8 \
--msft-age-min 180 \
--debug| Option | Default | Description |
|---|---|---|
--google-limit |
6000 | Max Google API calls (budget guard) |
--conf-max |
0.75 | Confidence ceiling for "uncertain" records |
--msft-age-min |
365 | Min MSFT staleness (days) for uncertainty |
--debug |
False | Print API requests/responses (first 10) |
Solution:
# Ensure data files exist
ls -la data/raw/
# Or specify full path
python src/train_competition.py --data /full/path/to/project_c_samples.parquetSolution:
# Ensure you're running from project root
cd /path/to/project-finale
python src/train_competition.py --data data/raw/project_c_samples.parquet
# Or add project to PYTHONPATH
export PYTHONPATH="${PYTHONPATH}:/path/to/project-finale"Google Places:
- Verify API key is correct and not expired
- Check that Places API is enabled in Google Cloud Console
- Ensure API key isn't restricted to a different service
OpenCorporates:
- Free tier doesn't require a key, but has rate limits
- With key, verify it's correct and not expired
Causes:
- Company not registered in OpenCorporates
- Company name fuzzy match score < 60
- Region/state mapping failed
Debug:
python tests/validate_fixes.py
python tests/diagnose_extraction.py
python api/external_enrichment.py --debugPossible causes:
- Insufficient external enrichment (run enrichment first)
- Class imbalance too severe (check class_balance in metrics JSON)
- Features not informative (check SHAP importance)
Solutions:
- Ensure enrichment file exists:
outputs/enrichment_combined.parquet - Try different thresholds: Check
mcc_by_thresholdin metrics JSON - Add external signals: Re-run enrichment with Google Places API
On sample data (Project C dataset):
| Metric | Value |
|---|---|
| OOF MCC | 0.42β0.47 |
| Train samples | ~5,000 |
| Validation samples | ~2,000 |
| Top feature | google_is_permanently_closed |
| Best threshold | 0.30β0.40 |
(Performance varies based on:
- Data quality & freshness
- External enrichment success rate
- Class imbalance
- Temporal signals)
Centralized path management. All file paths resolved relative to project root.
from utils.paths import get_sample_data_path, get_enrichment_output_path
data_path = get_sample_data_path()
enrichment_path = get_enrichment_output_path()Shared API functions for OpenCorporates and Google Places.
from utils.api_utils import query_opencorporates, google_text_search
result = query_opencorporates(name, jurisdiction, api_key)
place_id, name, status = google_text_search(name, address, api_key)Main enrichment pipeline. Handles schema detection, API calls, and feature merging.
python api/external_enrichment.py --google-key KEYMain training pipeline. Feature engineering, cross-validation, threshold optimization.
python src/train_competition.py --data data/raw/project_c_samples.parquetFor bug reports or feature requests, please open an issue or submit a pull request.
See LICENSE file for details.
For questions or issues:
- Check Troubleshooting section
- Review diagnostic output:
python tests/diagnose_extraction.py - Check logs in
outputs/directory
Last Updated: June 2026
Version: 1.0.0
Status: Production-ready