Project Finale: Open/Closed Business Prediction System

Predict whether a business is OPEN or CLOSED using a multi-source signal stack, external enrichment, and an MCC-optimized XGBoost pipeline.

📋 Table of Contents

Project Overview
Directory Structure
Setup Instructions
Running the Pipeline
API Integration
Features & Model
Output Files
Troubleshooting

📊 Project Overview

Aspect	Details
Project Name	Project Finale (Project C)
Goal	Predict business operational status (open/closed)
Primary Metric	Matthews Correlation Coefficient (MCC)
Approach	Multi-source enrichment + leak-aware XGBoost classification
Data Sources	Overture (embedded) + Google Places + Foursquare OS + OpenCorporates + Microsoft signals
Model	XGBoost with calibration, 5-fold stratified cross-validation

Key Features

Multi-source enrichment pipeline
- Google Places API: Real-time business status
- OpenCorporates: Legal entity status & incorporation dates
- Microsoft signals: Data freshness & staleness indicators
Leak-aware feature engineering
- Temporal recency signals (update dates, staleness)
- Presence signals (websites, phones, socials, branding)
- Geographic and category features
MCC-optimized threshold tuning
- Sweeps decision thresholds 0.1–0.7
- Selects best threshold based on 5-fold OOF MCC
- Reduces false positives in class-imbalanced data
Model calibration
- Isotonic calibration on full training set
- Probability outputs suitable for downstream decision-making

📁 Directory Structure

Project-Finale/
├── data/                          # Raw and processed datasets
│   ├── raw/                       # Original data files
│   │   ├── project_c_samples.parquet
│   │   └── sample-open-prediction.parquet
│   └── processed/                 # Processed datasets (after enrichment)
│
├── src/                           # Core source code
│   ├── train_competition.py       # Train XGBoost on labeled data
│   ├── predict.py                 # Inference on sample-open-prediction
│   └── __init__.py
│
├── api/                           # External API integrations
│   ├── external_enrichment.py     # Google + OpenCorporates + Foursquare enrichment
│   ├── foursquare_enrichment.py   # Foursquare-only CLI
│   └── foursquare_data.py         # Foursquare dataset downloader
│
├── run_pipeline.py                # Orchestrator: enrich → train → predict
│
├── utils/                         # Shared utilities
│   ├── paths.py                   # Centralized path management
│   ├── schema.py                  # record_id / names normalization
│   ├── features.py                # Overture + enrichment feature engineering
│   ├── foursquare_enrich.py       # Foursquare DuckDB enrichment
│   ├── api_utils.py               # API helpers (OpenCorporates, Google Places)
│   └── __init__.py
│
├── models/                        # Saved models and checkpoints
│   └── competition_model.json     # XGBoost model (after training)
│
├── outputs/                       # Results and artifacts
│   ├── predictions_sample_open.csv # Predictions for sample-open-prediction
│   ├── enrichment_train.parquet   # Enrichment for training set
│   ├── enrichment_predict.parquet # Enrichment for prediction set
│   ├── artifacts.json             # Inference threshold + encoders
│   ├── enrichment_combined.parquet # Legacy combined enrichment path
│   ├── opencorp_features.parquet  # OpenCorporates enrichment results
│   ├── google_features.parquet    # Google Places enrichment results
│   ├── mcc_threshold_curve.png    # MCC sweep visualization
│   ├── shap_importance.png        # Feature importance
│   ├── shap_beeswarm.png          # SHAP interaction plot
│   └── metrics_competition.json   # Training metrics & config
│
├── configs/                       # Configuration files
│   └── (API keys stored locally, not in git)
│
├── tests/                         # Test and diagnostic scripts
│   ├── diagnose_extraction.py     # Data structure analysis
│   ├── diagnose_api.py            # API diagnostic
│   ├── test_google_places.py      # Google Places API test
│   └── validate_fixes.py          # Validation checks
│
├── notebooks/                     # Jupyter notebooks (if any)
│
├── fsq_data/                      # Foursquare dataset (downloaded on demand)
│   └── release/
│       └── dt=2026-05-14/
│
├── .git/                          # Version control
├── .venv/                         # Python virtual environment
├── LICENSE                        # License file
├── README.md                      # This file
└── requirements.txt               # Python dependencies

🔧 Setup Instructions

1. Prerequisites

Python 3.9+
Git
pip or conda

2. Clone Repository

git clone https://github.com/your-repo/project-finale.git
cd project-finale

3. Create Virtual Environment

# Using venv
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# or
.venv\Scripts\activate     # Windows

# Or using conda
conda create -n project-c python=3.10
conda activate project-c

4. Install Dependencies

pip install -r requirements.txt

Key packages:

pandas==1.5.0
xgboost==1.7.0
scikit-learn==1.2.0
shap==0.41.0
matplotlib==3.6.0
requests==2.28.0
rapidfuzz==2.14.0
pyarrow==10.0.0
shapely==2.0.0
huggingface_hub==0.11.0

5. Environment Variables

Create a .env file in the project root (add to .gitignore):

# .env
GOOGLE_PLACES_API_KEY=your_api_key_here
OPENCORP_API_KEY=your_api_key_here

Or export in terminal:

export GOOGLE_PLACES_API_KEY="your_key"
export OPENCORP_API_KEY="your_key"

🚀 Running the Pipeline

Prerequisites

Place both parquet files under data/:

data/project_c_samples.parquet — labeled training data (Overture schema)
data/sample-open-prediction.parquet — businesses to predict

One-command pipeline (recommended)

Trains on labeled data, predicts open/closed for sample-open-prediction.parquet:

export GOOGLE_PLACES_API_KEY="your_key"   # optional but recommended

python run_pipeline.py --google-key $GOOGLE_PLACES_API_KEY

Flags: --skip-google, --skip-foursquare, --fsq-skip-download, --train-only, --predict-only.

Step-by-step

# 1. Enrich training rows (Google + OpenCorporates + Foursquare)
python api/external_enrichment.py \
    --data data/project_c_samples.parquet \
    --enrichment-out outputs/enrichment_train.parquet \
    --google-key $GOOGLE_PLACES_API_KEY

# 2. Enrich prediction rows
python api/external_enrichment.py \
    --data data/sample-open-prediction.parquet \
    --enrichment-out outputs/enrichment_predict.parquet \
    --google-key $GOOGLE_PLACES_API_KEY \
    --fsq-skip-download

# 3. Train on labeled data
python src/train_competition.py \
    --data data/project_c_samples.parquet \
    --enrichment outputs/enrichment_train.parquet \
    --n-folds 5

# 4. Predict
python src/predict.py \
    --data data/sample-open-prediction.parquet \
    --enrichment outputs/enrichment_predict.parquet

Outputs:

File	Description
`outputs/enrichment_train.parquet`	External features for training set
`outputs/enrichment_predict.parquet`	External features for prediction set
`models/competition_model.json`	Trained XGBoost model
`outputs/artifacts.json`	Threshold + encoders for inference
`outputs/predictions_sample_open.csv`	Final `record_id`, `open`, `probability_open`

Quick Test (training only, no enrichment)

python src/train_competition.py \
    --data data/project_c_samples.parquet \
    --n-folds 3

API Enrichment Options

Google Places API Only

python api/external_enrichment.py \
    --data data/raw/project_c_samples.parquet \
    --google-key YOUR_KEY \
    --skip-opencorp

OpenCorporates API Only

python api/external_enrichment.py \
    --data data/raw/project_c_samples.parquet \
    --skip-google

Both APIs

python api/external_enrichment.py \
    --data data/raw/project_c_samples.parquet \
    --google-key YOUR_GOOGLE_KEY \
    --opencorp-key YOUR_OC_KEY

Debug Mode (Show API Requests/Responses)

python api/external_enrichment.py \
    --data data/raw/project_c_samples.parquet \
    --google-key YOUR_KEY \
    --debug

🌐 API Integration

Google Places API

Provides: Real-time business status (OPERATIONAL, CLOSED_PERMANENTLY, CLOSED_TEMPORARILY)

Setup:

Go to Google Cloud Console
Create a new project
Enable the "Places API"
Create an API key (restrict to Places API)
Use: --google-key YOUR_KEY

Cost: ~$0.017 per request (2 requests per record = ~$340 for 10K records at 200 free tier)

Troubleshooting:

401 Unauthorized: API key invalid or API not enabled
429 Rate Limited: Automatic backoff in script; wait and retry
0 matches: Business not indexed by Google; common for small/new businesses

OpenCorporates API

Provides: Legal entity status (active/dissolved), incorporation date, company age

Setup:

Go to OpenCorporates.com
Create an account and get an API key (optional for free tier)
Use: --opencorp-key YOUR_KEY (for higher rate limits)

Cost: Free tier available (free tier is ~4 requests/second)

Troubleshooting:

No API key needed for free tier, but rate limited
401 Unauthorized: Invalid API key
0 matches: Company not in OpenCorporates database (common for non-US entities)

Foursquare Dataset

Download the full Foursquare Open Squares dataset:

python api/foursquare_data.py

Requires Hugging Face token (set HF_TOKEN environment variable for private datasets).

🎯 Features & Model

Feature Engineering

Temporal features:

msft_age_days: Days since Microsoft last updated (staleness signal)
days_since_latest_update: Latest update across all sources
update_span_days: Time span between oldest and latest update

Presence features:

has_phone, has_website, has_social: Binary presence
presence_score: Sum of all presence signals (0–4)
presence_ratio: Presence completeness (0–1)

Contact completeness:

addr_completeness: Address field count (0–4)
missing_count: Total missing metadata (0–5)

Naming features:

name_len: Character length
name_words: Word count
name_has_digit: Contains numbers
name_has_hash: Contains "#" (store numbers)

Geographic:

lat, lon: Coordinates (rounded)
geo_density: Business density by lat/lon grid

Category:

primary_cat_enc: Encoded primary category
alt_cat_count: Alternate category count

External signals (from enrichment APIs):

google_is_permanently_closed: Google status = CLOSED_PERMANENTLY ⭐⭐
google_is_operational: Google status = OPERATIONAL
oc_is_dissolved: Legal status from OpenCorporates = dissolved ⭐
google_signal_strength: Match confidence × name score
google_engagement_score: Log(rating_count) × rating

Model Parameters

n_estimators=600
max_depth=5
learning_rate=0.04
subsample=0.8
colsample_bytree=0.75
scale_pos_weight=<neg/pos ratio>  # Handles class imbalance

Optimization

Metric: Matthews Correlation Coefficient (MCC)
Threshold sweep: 0.1–0.7 in 0.05 increments
Selection: Best threshold from 5-fold OOF MCC
Calibration: Isotonic calibration on full train set

📤 Output Files

After Enrichment (`api/external_enrichment.py`)

File	Contents
`outputs/opencorp_features.parquet`	OpenCorporates results (OC status, age, etc.)
`outputs/google_features.parquet`	Google Places results (business status, rating, etc.)
`outputs/enrichment_combined.parquet`	Merged enrichment features (join on `record_id`)

After Training (`src/train_competition.py`)

File	Contents
`outputs/submission.csv`	Predictions: `record_id`, `open` ("open"/"closed")
`models/competition_model.json`	Trained XGBoost model (saveable/loadable)
`outputs/metrics_competition.json`	Training metrics, features, thresholds, config
`outputs/mcc_threshold_curve.png`	MCC vs decision threshold (with ±1 std envelope)
`outputs/shap_importance.png`	Feature importance (mean \|SHAP\|)
`outputs/shap_beeswarm.png`	SHAP interaction plot (top 20 features)

Metrics JSON Structure

{
  "n_train": 5000,
  "n_val": 2000,
  "n_features": 43,
  "features": ["confidence", "source_count", ...],
  "class_balance_train": {"open": 3200, "closed": 1800},
  "best_threshold": 0.35,
  "oof_mcc": 0.4567,
  "mcc_by_threshold": {
    "0.10": 0.3421,
    "0.15": 0.4123,
    ...
    "0.35": 0.4567
  },
  "top_10_features_by_shap": [
    "google_is_permanently_closed",
    "msft_age_days",
    ...
  ],
  "model_params": {...},
  "has_external_signal": false
}

🔍 Diagnostics & Testing

Data Structure Analysis

python tests/diagnose_extraction.py

Displays:

Column names, dtypes, shapes
Names extraction quality
Region extraction & state mapping
API gate pass rate

API Configuration Check

python tests/validate_fixes.py --opencorp-key YOUR_KEY

Tests:

Data preprocessing
Name and region extraction
OpenCorporates API connectivity
Sample API call

Google Places API Test

python tests/test_google_places.py

Tests first 10 Google Places API calls (text search + place details).

⚙️ Configuration

Advanced Training Options

python src/train_competition.py \
    --data data/raw/my_dataset.parquet \
    --n-folds 10 \
    --seed 123 \
    --out outputs/

Advanced Enrichment Options

python api/external_enrichment.py \
    --data data/raw/my_data.parquet \
    --google-key YOUR_KEY \
    --google-limit 5000 \
    --conf-max 0.8 \
    --msft-age-min 180 \
    --debug

Option	Default	Description
`--google-limit`	6000	Max Google API calls (budget guard)
`--conf-max`	0.75	Confidence ceiling for "uncertain" records
`--msft-age-min`	365	Min MSFT staleness (days) for uncertainty
`--debug`	False	Print API requests/responses (first 10)

🐛 Troubleshooting

"FileNotFoundError: No data file found"

Solution:

# Ensure data files exist
ls -la data/raw/

# Or specify full path
python src/train_competition.py --data /full/path/to/project_c_samples.parquet

"ModuleNotFoundError: No module named 'utils'"

Solution:

# Ensure you're running from project root
cd /path/to/project-finale
python src/train_competition.py --data data/raw/project_c_samples.parquet

# Or add project to PYTHONPATH
export PYTHONPATH="${PYTHONPATH}:/path/to/project-finale"

"RequestException: HTTPError: 401 Client Error"

Google Places:

Verify API key is correct and not expired
Check that Places API is enabled in Google Cloud Console
Ensure API key isn't restricted to a different service

OpenCorporates:

Free tier doesn't require a key, but has rate limits
With key, verify it's correct and not expired

"0 matches found" (OpenCorporates)

Causes:

Company not registered in OpenCorporates
Company name fuzzy match score < 60
Region/state mapping failed

Debug:

python tests/validate_fixes.py
python tests/diagnose_extraction.py
python api/external_enrichment.py --debug

Low Model Performance (MCC < 0.3)

Possible causes:

Insufficient external enrichment (run enrichment first)
Class imbalance too severe (check class_balance in metrics JSON)
Features not informative (check SHAP importance)

Solutions:

Ensure enrichment file exists: outputs/enrichment_combined.parquet
Try different thresholds: Check mcc_by_threshold in metrics JSON
Add external signals: Re-run enrichment with Google Places API

📈 Expected Performance

On sample data (Project C dataset):

Metric	Value
OOF MCC	0.42–0.47
Train samples	~5,000
Validation samples	~2,000
Top feature	`google_is_permanently_closed`
Best threshold	0.30–0.40

(Performance varies based on:

Data quality & freshness
External enrichment success rate
Class imbalance
Temporal signals)

📚 Code Structure

utils/paths.py

Centralized path management. All file paths resolved relative to project root.

from utils.paths import get_sample_data_path, get_enrichment_output_path
data_path = get_sample_data_path()
enrichment_path = get_enrichment_output_path()

utils/api_utils.py

Shared API functions for OpenCorporates and Google Places.

from utils.api_utils import query_opencorporates, google_text_search
result = query_opencorporates(name, jurisdiction, api_key)
place_id, name, status = google_text_search(name, address, api_key)

api/external_enrichment.py

Main enrichment pipeline. Handles schema detection, API calls, and feature merging.

python api/external_enrichment.py --google-key KEY

src/train_competition.py

Main training pipeline. Feature engineering, cross-validation, threshold optimization.

python src/train_competition.py --data data/raw/project_c_samples.parquet

🤝 Contributing

For bug reports or feature requests, please open an issue or submit a pull request.

📄 License

See LICENSE file for details.

📞 Support

For questions or issues:

Check Troubleshooting section
Review diagnostic output: python tests/diagnose_extraction.py
Check logs in outputs/ directory

Last Updated: June 2026
Version: 1.0.0
Status: Production-ready

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
__pycache__		__pycache__
api		api
data		data
debug		debug
outputs		outputs
src		src
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

Project Finale: Open/Closed Business Prediction System

📋 Table of Contents

📊 Project Overview

Key Features

📁 Directory Structure

🔧 Setup Instructions

1. Prerequisites

2. Clone Repository

3. Create Virtual Environment

4. Install Dependencies

5. Environment Variables

🚀 Running the Pipeline

Prerequisites

One-command pipeline (recommended)

Step-by-step

Quick Test (training only, no enrichment)

API Enrichment Options

Google Places API Only

OpenCorporates API Only

Both APIs

Debug Mode (Show API Requests/Responses)

🌐 API Integration

Google Places API

OpenCorporates API

Foursquare Dataset

🎯 Features & Model

Feature Engineering

Model Parameters

Optimization

📤 Output Files

After Enrichment (api/external_enrichment.py)

After Training (src/train_competition.py)

Metrics JSON Structure

🔍 Diagnostics & Testing

Data Structure Analysis

API Configuration Check

Google Places API Test

⚙️ Configuration

Advanced Training Options

Advanced Enrichment Options

🐛 Troubleshooting

"FileNotFoundError: No data file found"

"ModuleNotFoundError: No module named 'utils'"

"RequestException: HTTPError: 401 Client Error"

"0 matches found" (OpenCorporates)

Low Model Performance (MCC < 0.3)

📈 Expected Performance

📚 Code Structure

utils/paths.py

utils/api_utils.py

api/external_enrichment.py

src/train_competition.py

🤝 Contributing

📄 License

📞 Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

After Enrichment (`api/external_enrichment.py`)

After Training (`src/train_competition.py`)

Packages