Skip to content

project-terraforma/Project-Finale

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Project Finale: Open/Closed Business Prediction System

Predict whether a business is OPEN or CLOSED using a multi-source signal stack, external enrichment, and an MCC-optimized XGBoost pipeline.


πŸ“‹ Table of Contents


πŸ“Š Project Overview

Aspect Details
Project Name Project Finale (Project C)
Goal Predict business operational status (open/closed)
Primary Metric Matthews Correlation Coefficient (MCC)
Approach Multi-source enrichment + leak-aware XGBoost classification
Data Sources Overture (embedded) + Google Places + Foursquare OS + OpenCorporates + Microsoft signals
Model XGBoost with calibration, 5-fold stratified cross-validation

Key Features

  1. Multi-source enrichment pipeline

    • Google Places API: Real-time business status
    • OpenCorporates: Legal entity status & incorporation dates
    • Microsoft signals: Data freshness & staleness indicators
  2. Leak-aware feature engineering

    • Temporal recency signals (update dates, staleness)
    • Presence signals (websites, phones, socials, branding)
    • Geographic and category features
  3. MCC-optimized threshold tuning

    • Sweeps decision thresholds 0.1–0.7
    • Selects best threshold based on 5-fold OOF MCC
    • Reduces false positives in class-imbalanced data
  4. Model calibration

    • Isotonic calibration on full training set
    • Probability outputs suitable for downstream decision-making

πŸ“ Directory Structure

Project-Finale/
β”œβ”€β”€ data/                          # Raw and processed datasets
β”‚   β”œβ”€β”€ raw/                       # Original data files
β”‚   β”‚   β”œβ”€β”€ project_c_samples.parquet
β”‚   β”‚   └── sample-open-prediction.parquet
β”‚   └── processed/                 # Processed datasets (after enrichment)
β”‚
β”œβ”€β”€ src/                           # Core source code
β”‚   β”œβ”€β”€ train_competition.py       # Train XGBoost on labeled data
β”‚   β”œβ”€β”€ predict.py                 # Inference on sample-open-prediction
β”‚   └── __init__.py
β”‚
β”œβ”€β”€ api/                           # External API integrations
β”‚   β”œβ”€β”€ external_enrichment.py     # Google + OpenCorporates + Foursquare enrichment
β”‚   β”œβ”€β”€ foursquare_enrichment.py   # Foursquare-only CLI
β”‚   └── foursquare_data.py         # Foursquare dataset downloader
β”‚
β”œβ”€β”€ run_pipeline.py                # Orchestrator: enrich β†’ train β†’ predict
β”‚
β”œβ”€β”€ utils/                         # Shared utilities
β”‚   β”œβ”€β”€ paths.py                   # Centralized path management
β”‚   β”œβ”€β”€ schema.py                  # record_id / names normalization
β”‚   β”œβ”€β”€ features.py                # Overture + enrichment feature engineering
β”‚   β”œβ”€β”€ foursquare_enrich.py       # Foursquare DuckDB enrichment
β”‚   β”œβ”€β”€ api_utils.py               # API helpers (OpenCorporates, Google Places)
β”‚   └── __init__.py
β”‚
β”œβ”€β”€ models/                        # Saved models and checkpoints
β”‚   └── competition_model.json     # XGBoost model (after training)
β”‚
β”œβ”€β”€ outputs/                       # Results and artifacts
β”‚   β”œβ”€β”€ predictions_sample_open.csv # Predictions for sample-open-prediction
β”‚   β”œβ”€β”€ enrichment_train.parquet   # Enrichment for training set
β”‚   β”œβ”€β”€ enrichment_predict.parquet # Enrichment for prediction set
β”‚   β”œβ”€β”€ artifacts.json             # Inference threshold + encoders
β”‚   β”œβ”€β”€ enrichment_combined.parquet # Legacy combined enrichment path
β”‚   β”œβ”€β”€ opencorp_features.parquet  # OpenCorporates enrichment results
β”‚   β”œβ”€β”€ google_features.parquet    # Google Places enrichment results
β”‚   β”œβ”€β”€ mcc_threshold_curve.png    # MCC sweep visualization
β”‚   β”œβ”€β”€ shap_importance.png        # Feature importance
β”‚   β”œβ”€β”€ shap_beeswarm.png          # SHAP interaction plot
β”‚   └── metrics_competition.json   # Training metrics & config
β”‚
β”œβ”€β”€ configs/                       # Configuration files
β”‚   └── (API keys stored locally, not in git)
β”‚
β”œβ”€β”€ tests/                         # Test and diagnostic scripts
β”‚   β”œβ”€β”€ diagnose_extraction.py     # Data structure analysis
β”‚   β”œβ”€β”€ diagnose_api.py            # API diagnostic
β”‚   β”œβ”€β”€ test_google_places.py      # Google Places API test
β”‚   └── validate_fixes.py          # Validation checks
β”‚
β”œβ”€β”€ notebooks/                     # Jupyter notebooks (if any)
β”‚
β”œβ”€β”€ fsq_data/                      # Foursquare dataset (downloaded on demand)
β”‚   └── release/
β”‚       └── dt=2026-05-14/
β”‚
β”œβ”€β”€ .git/                          # Version control
β”œβ”€β”€ .venv/                         # Python virtual environment
β”œβ”€β”€ LICENSE                        # License file
β”œβ”€β”€ README.md                      # This file
└── requirements.txt               # Python dependencies

πŸ”§ Setup Instructions

1. Prerequisites

  • Python 3.9+
  • Git
  • pip or conda

2. Clone Repository

git clone https://github.com/your-repo/project-finale.git
cd project-finale

3. Create Virtual Environment

# Using venv
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# or
.venv\Scripts\activate     # Windows

# Or using conda
conda create -n project-c python=3.10
conda activate project-c

4. Install Dependencies

pip install -r requirements.txt

Key packages:

pandas==1.5.0
xgboost==1.7.0
scikit-learn==1.2.0
shap==0.41.0
matplotlib==3.6.0
requests==2.28.0
rapidfuzz==2.14.0
pyarrow==10.0.0
shapely==2.0.0
huggingface_hub==0.11.0

5. Environment Variables

Create a .env file in the project root (add to .gitignore):

# .env
GOOGLE_PLACES_API_KEY=your_api_key_here
OPENCORP_API_KEY=your_api_key_here

Or export in terminal:

export GOOGLE_PLACES_API_KEY="your_key"
export OPENCORP_API_KEY="your_key"

πŸš€ Running the Pipeline

Prerequisites

Place both parquet files under data/:

  • data/project_c_samples.parquet β€” labeled training data (Overture schema)
  • data/sample-open-prediction.parquet β€” businesses to predict

One-command pipeline (recommended)

Trains on labeled data, predicts open/closed for sample-open-prediction.parquet:

export GOOGLE_PLACES_API_KEY="your_key"   # optional but recommended

python run_pipeline.py --google-key $GOOGLE_PLACES_API_KEY

Flags: --skip-google, --skip-foursquare, --fsq-skip-download, --train-only, --predict-only.

Step-by-step

# 1. Enrich training rows (Google + OpenCorporates + Foursquare)
python api/external_enrichment.py \
    --data data/project_c_samples.parquet \
    --enrichment-out outputs/enrichment_train.parquet \
    --google-key $GOOGLE_PLACES_API_KEY

# 2. Enrich prediction rows
python api/external_enrichment.py \
    --data data/sample-open-prediction.parquet \
    --enrichment-out outputs/enrichment_predict.parquet \
    --google-key $GOOGLE_PLACES_API_KEY \
    --fsq-skip-download

# 3. Train on labeled data
python src/train_competition.py \
    --data data/project_c_samples.parquet \
    --enrichment outputs/enrichment_train.parquet \
    --n-folds 5

# 4. Predict
python src/predict.py \
    --data data/sample-open-prediction.parquet \
    --enrichment outputs/enrichment_predict.parquet

Outputs:

File Description
outputs/enrichment_train.parquet External features for training set
outputs/enrichment_predict.parquet External features for prediction set
models/competition_model.json Trained XGBoost model
outputs/artifacts.json Threshold + encoders for inference
outputs/predictions_sample_open.csv Final record_id, open, probability_open

Quick Test (training only, no enrichment)

python src/train_competition.py \
    --data data/project_c_samples.parquet \
    --n-folds 3

API Enrichment Options

Google Places API Only

python api/external_enrichment.py \
    --data data/raw/project_c_samples.parquet \
    --google-key YOUR_KEY \
    --skip-opencorp

OpenCorporates API Only

python api/external_enrichment.py \
    --data data/raw/project_c_samples.parquet \
    --skip-google

Both APIs

python api/external_enrichment.py \
    --data data/raw/project_c_samples.parquet \
    --google-key YOUR_GOOGLE_KEY \
    --opencorp-key YOUR_OC_KEY

Debug Mode (Show API Requests/Responses)

python api/external_enrichment.py \
    --data data/raw/project_c_samples.parquet \
    --google-key YOUR_KEY \
    --debug

🌐 API Integration

Google Places API

Provides: Real-time business status (OPERATIONAL, CLOSED_PERMANENTLY, CLOSED_TEMPORARILY)

Setup:

  1. Go to Google Cloud Console
  2. Create a new project
  3. Enable the "Places API"
  4. Create an API key (restrict to Places API)
  5. Use: --google-key YOUR_KEY

Cost: ~$0.017 per request (2 requests per record = ~$340 for 10K records at 200 free tier)

Troubleshooting:

  • 401 Unauthorized: API key invalid or API not enabled
  • 429 Rate Limited: Automatic backoff in script; wait and retry
  • 0 matches: Business not indexed by Google; common for small/new businesses

OpenCorporates API

Provides: Legal entity status (active/dissolved), incorporation date, company age

Setup:

  1. Go to OpenCorporates.com
  2. Create an account and get an API key (optional for free tier)
  3. Use: --opencorp-key YOUR_KEY (for higher rate limits)

Cost: Free tier available (free tier is ~4 requests/second)

Troubleshooting:

  • No API key needed for free tier, but rate limited
  • 401 Unauthorized: Invalid API key
  • 0 matches: Company not in OpenCorporates database (common for non-US entities)

Foursquare Dataset

Download the full Foursquare Open Squares dataset:

python api/foursquare_data.py

Requires Hugging Face token (set HF_TOKEN environment variable for private datasets).


🎯 Features & Model

Feature Engineering

Temporal features:

  • msft_age_days: Days since Microsoft last updated (staleness signal)
  • days_since_latest_update: Latest update across all sources
  • update_span_days: Time span between oldest and latest update

Presence features:

  • has_phone, has_website, has_social: Binary presence
  • presence_score: Sum of all presence signals (0–4)
  • presence_ratio: Presence completeness (0–1)

Contact completeness:

  • addr_completeness: Address field count (0–4)
  • missing_count: Total missing metadata (0–5)

Naming features:

  • name_len: Character length
  • name_words: Word count
  • name_has_digit: Contains numbers
  • name_has_hash: Contains "#" (store numbers)

Geographic:

  • lat, lon: Coordinates (rounded)
  • geo_density: Business density by lat/lon grid

Category:

  • primary_cat_enc: Encoded primary category
  • alt_cat_count: Alternate category count

External signals (from enrichment APIs):

  • google_is_permanently_closed: Google status = CLOSED_PERMANENTLY ⭐⭐
  • google_is_operational: Google status = OPERATIONAL
  • oc_is_dissolved: Legal status from OpenCorporates = dissolved ⭐
  • google_signal_strength: Match confidence Γ— name score
  • google_engagement_score: Log(rating_count) Γ— rating

Model Parameters

n_estimators=600
max_depth=5
learning_rate=0.04
subsample=0.8
colsample_bytree=0.75
scale_pos_weight=<neg/pos ratio>  # Handles class imbalance

Optimization

  • Metric: Matthews Correlation Coefficient (MCC)
  • Threshold sweep: 0.1–0.7 in 0.05 increments
  • Selection: Best threshold from 5-fold OOF MCC
  • Calibration: Isotonic calibration on full train set

πŸ“€ Output Files

After Enrichment (api/external_enrichment.py)

File Contents
outputs/opencorp_features.parquet OpenCorporates results (OC status, age, etc.)
outputs/google_features.parquet Google Places results (business status, rating, etc.)
outputs/enrichment_combined.parquet Merged enrichment features (join on record_id)

After Training (src/train_competition.py)

File Contents
outputs/submission.csv Predictions: record_id, open ("open"/"closed")
models/competition_model.json Trained XGBoost model (saveable/loadable)
outputs/metrics_competition.json Training metrics, features, thresholds, config
outputs/mcc_threshold_curve.png MCC vs decision threshold (with Β±1 std envelope)
outputs/shap_importance.png Feature importance (mean |SHAP|)
outputs/shap_beeswarm.png SHAP interaction plot (top 20 features)

Metrics JSON Structure

{
  "n_train": 5000,
  "n_val": 2000,
  "n_features": 43,
  "features": ["confidence", "source_count", ...],
  "class_balance_train": {"open": 3200, "closed": 1800},
  "best_threshold": 0.35,
  "oof_mcc": 0.4567,
  "mcc_by_threshold": {
    "0.10": 0.3421,
    "0.15": 0.4123,
    ...
    "0.35": 0.4567
  },
  "top_10_features_by_shap": [
    "google_is_permanently_closed",
    "msft_age_days",
    ...
  ],
  "model_params": {...},
  "has_external_signal": false
}

πŸ” Diagnostics & Testing

Data Structure Analysis

python tests/diagnose_extraction.py

Displays:

  • Column names, dtypes, shapes
  • Names extraction quality
  • Region extraction & state mapping
  • API gate pass rate

API Configuration Check

python tests/validate_fixes.py --opencorp-key YOUR_KEY

Tests:

  • Data preprocessing
  • Name and region extraction
  • OpenCorporates API connectivity
  • Sample API call

Google Places API Test

python tests/test_google_places.py

Tests first 10 Google Places API calls (text search + place details).


βš™οΈ Configuration

Advanced Training Options

python src/train_competition.py \
    --data data/raw/my_dataset.parquet \
    --n-folds 10 \
    --seed 123 \
    --out outputs/

Advanced Enrichment Options

python api/external_enrichment.py \
    --data data/raw/my_data.parquet \
    --google-key YOUR_KEY \
    --google-limit 5000 \
    --conf-max 0.8 \
    --msft-age-min 180 \
    --debug
Option Default Description
--google-limit 6000 Max Google API calls (budget guard)
--conf-max 0.75 Confidence ceiling for "uncertain" records
--msft-age-min 365 Min MSFT staleness (days) for uncertainty
--debug False Print API requests/responses (first 10)

πŸ› Troubleshooting

"FileNotFoundError: No data file found"

Solution:

# Ensure data files exist
ls -la data/raw/

# Or specify full path
python src/train_competition.py --data /full/path/to/project_c_samples.parquet

"ModuleNotFoundError: No module named 'utils'"

Solution:

# Ensure you're running from project root
cd /path/to/project-finale
python src/train_competition.py --data data/raw/project_c_samples.parquet

# Or add project to PYTHONPATH
export PYTHONPATH="${PYTHONPATH}:/path/to/project-finale"

"RequestException: HTTPError: 401 Client Error"

Google Places:

  • Verify API key is correct and not expired
  • Check that Places API is enabled in Google Cloud Console
  • Ensure API key isn't restricted to a different service

OpenCorporates:

  • Free tier doesn't require a key, but has rate limits
  • With key, verify it's correct and not expired

"0 matches found" (OpenCorporates)

Causes:

  • Company not registered in OpenCorporates
  • Company name fuzzy match score < 60
  • Region/state mapping failed

Debug:

python tests/validate_fixes.py
python tests/diagnose_extraction.py
python api/external_enrichment.py --debug

Low Model Performance (MCC < 0.3)

Possible causes:

  • Insufficient external enrichment (run enrichment first)
  • Class imbalance too severe (check class_balance in metrics JSON)
  • Features not informative (check SHAP importance)

Solutions:

  1. Ensure enrichment file exists: outputs/enrichment_combined.parquet
  2. Try different thresholds: Check mcc_by_threshold in metrics JSON
  3. Add external signals: Re-run enrichment with Google Places API

πŸ“ˆ Expected Performance

On sample data (Project C dataset):

Metric Value
OOF MCC 0.42–0.47
Train samples ~5,000
Validation samples ~2,000
Top feature google_is_permanently_closed
Best threshold 0.30–0.40

(Performance varies based on:

  • Data quality & freshness
  • External enrichment success rate
  • Class imbalance
  • Temporal signals)

πŸ“š Code Structure

utils/paths.py

Centralized path management. All file paths resolved relative to project root.

from utils.paths import get_sample_data_path, get_enrichment_output_path
data_path = get_sample_data_path()
enrichment_path = get_enrichment_output_path()

utils/api_utils.py

Shared API functions for OpenCorporates and Google Places.

from utils.api_utils import query_opencorporates, google_text_search
result = query_opencorporates(name, jurisdiction, api_key)
place_id, name, status = google_text_search(name, address, api_key)

api/external_enrichment.py

Main enrichment pipeline. Handles schema detection, API calls, and feature merging.

python api/external_enrichment.py --google-key KEY

src/train_competition.py

Main training pipeline. Feature engineering, cross-validation, threshold optimization.

python src/train_competition.py --data data/raw/project_c_samples.parquet

🀝 Contributing

For bug reports or feature requests, please open an issue or submit a pull request.


πŸ“„ License

See LICENSE file for details.


πŸ“ž Support

For questions or issues:

  1. Check Troubleshooting section
  2. Review diagnostic output: python tests/diagnose_extraction.py
  3. Check logs in outputs/ directory

Last Updated: June 2026
Version: 1.0.0
Status: Production-ready

About

Open closed Prediction Project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages