Skip to content

KodeCharya/XGB-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

XGBoost from Scratch

A complete implementation of XGBoost (Extreme Gradient Boosting) built from the ground up in Python, featuring gradient boosted decision trees for tabular data with all the essential XGBoost features.

๐Ÿš€ Features

Core Algorithm

  • CART Trees with second-order Taylor approximation
  • Histogram-based split finding with weighted quantile bins
  • Exact greedy split finding (when max_bins=None)
  • Depth-wise tree growth with post-pruning
  • Missing value handling with learned default directions

Regularization & Control

  • L1/L2 regularization (reg_alpha, reg_lambda)
  • Learning rate shrinkage (learning_rate)
  • Tree pruning with minimum split gain (gamma)
  • Sample subsampling (subsample)
  • Feature subsampling (colsample_bytree, colsample_bylevel, colsample_bynode)
  • Minimum child weight constraint (min_child_weight)

Objectives & Tasks

  • Regression: reg:squarederror
  • Binary Classification: binary:logistic
  • Multi-class Classification: multi:softprob

Training Features

  • Early stopping with validation monitoring
  • Custom evaluation metrics (RMSE, LogLoss, AUC, etc.)
  • Class weights and sample weights support
  • Reproducible training with random seeds
  • Parallel processing support

Model Management

  • JSON model serialization for persistence
  • Feature importance (gain, cover, frequency)
  • Training history and evaluation tracking
  • scikit-learn compatible API

๐Ÿ“ฆ Installation

# Clone the repository
git clone https://github.com/KodeCharya/xgb-scratch.git
cd xgb-scratch

# Install in development mode
pip install -e .

# Or install with optional dependencies
pip install -e ".[numba,test,dev,docs]"

๐Ÿƒ Quick Start

Binary Classification

from xgb_scratch import XGBClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
model = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    random_state=42
)

model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=True)

# Make predictions
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)

print(f"Accuracy: {model.score(X_test, y_test):.4f}")

Regression

from xgb_scratch import XGBRegressor
from sklearn.datasets import make_regression

# Generate sample data
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
model = XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=6,
    reg_lambda=1.0,
    reg_alpha=0.1,
    random_state=42
)

model.fit(X_train, y_train, eval_set=[(X_test, y_test)])

# Make predictions
predictions = model.predict(X_test)

Multi-class Classification

from xgb_scratch import XGBClassifier
from sklearn.datasets import make_classification

# Generate multi-class data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=3, 
                         n_clusters_per_class=1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
model = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=6,
    objective="multi:softprob",
    random_state=42
)

model.fit(X_train, y_train, eval_set=[(X_test, y_test)])
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)  # Shape: (n_samples, n_classes)

๐Ÿ–ฅ๏ธ Command Line Interface

The package includes a powerful CLI for training and inference:

Training a Model

# Binary classification
xgbs fit --train data/train.csv --target target --task binary \
         --valid data/valid.csv --model-out model.json \
         --n-estimators 300 --learning-rate 0.05 --max-depth 8

# Regression
xgbs fit --train data/train.csv --target price --task regression \
         --model-out model.json --n-estimators 200 --max-bins 256

# Multi-class classification
xgbs fit --train data/train.csv --target species --task multiclass \
         --valid data/valid.csv --model-out model.json \
         --early-stopping-rounds 20

Making Predictions

# Predictions
xgbs predict --model model.json --data data/test.csv --output predictions.csv

# Probabilities (for classification)
xgbs predict --model model.json --data data/test.csv \
             --output predictions.csv --proba-out probabilities.npy

Visualization

# Feature importance plot
xgbs plot-importance --model model.json --type gain --out importance.png

# Decision tree visualization
xgbs plot-tree --model model.json --tree-index 0 --out tree.png --max-depth 3

# Model information
xgbs info --model model.json

๐Ÿ”ง Advanced Usage

Early Stopping

model = XGBClassifier(
    n_estimators=1000,
    early_stopping_rounds=50,
    eval_metric="logloss"
)

model.fit(X_train, y_train, eval_set=[(X_val, y_val)])
print(f"Best iteration: {model.best_iteration_}")
print(f"Best score: {model.best_score_}")

Custom Evaluation Metrics

model = XGBClassifier(eval_metric="auc")  # or "logloss", "error"
model = XGBRegressor(eval_metric="rmse")  # or "mae"

Feature Importance

# After training
importance_gain = model.feature_importances_  # Normalized gain importance
booster = model.get_booster()

# Different importance types
importance_dict = booster.get_feature_importance("gain")     # Split gain
cover_dict = booster.get_feature_importance("cover")         # Sample coverage
freq_dict = booster.get_feature_importance("frequency")      # Split frequency

Model Persistence

# Save model
model.save_model("my_model.json")

# Load model
new_model = XGBClassifier()
new_model.load_model("my_model.json")

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 6, 9],
    'reg_lambda': [0.1, 1.0, 10.0],
    'subsample': [0.8, 0.9, 1.0]
}

model = XGBClassifier(random_state=42)
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")

๐Ÿงฎ Algorithm Details

Mathematical Foundation

XGBoost optimizes the following objective function:

L = ฮฃ l(yi, ลทi) + ฮฃ ฮฉ(fk)

Where:

  • l(yi, ลทi) is the loss function
  • ฮฉ(fk) = ฮณT + ยฝฮป||w||ยฒ + ฮฑ||w||โ‚ is the regularization term

Split Finding

For each potential split, the gain is computed as:

Gain = ยฝ [GLยฒ/(HL+ฮป) + GRยฒ/(HR+ฮป) - (GL+GR)ยฒ/(HL+HR+ฮป)] - ฮณ

Where:

  • GL, GR are gradient sums for left and right children
  • HL, HR are hessian sums for left and right children
  • ฮป is L2 regularization, ฮณ is minimum split gain

Leaf Weight Calculation

Optimal leaf weights are computed as:

w* = -Gฬƒ/(H+ฮป)

Where Gฬƒ is the L1-regularized gradient sum:

Gฬƒ = sign(G) * max(|G| - ฮฑ, 0)

๐ŸŽฏ Performance & Benchmarks

Our implementation achieves competitive performance compared to other gradient boosting libraries:

Speed Comparison (relative to sklearn's HistGradientBoosting)

  • Small datasets (< 10K samples): ~2-3x slower
  • Medium datasets (10K-100K samples): ~1.5-2x slower
  • Large datasets (> 100K samples): ~2-4x slower

Accuracy

  • Matches sklearn's HistGradientBoosting within 1-2% on standard benchmarks
  • Supports all major XGBoost features for optimal model performance

Memory Usage

  • Efficient histogram-based split finding
  • Optional numba acceleration for performance-critical paths
  • Configurable memory usage via max_bins parameter

๐Ÿ”ฌ Architecture

xgb-scratch/
โ”œโ”€โ”€ src/xgb_scratch/
โ”‚   โ”œโ”€โ”€ core/                    # Core algorithm implementation
โ”‚   โ”‚   โ”œโ”€โ”€ objectives.py        # Loss functions & gradients
โ”‚   โ”‚   โ”œโ”€โ”€ tree.py             # Tree structures & operations
โ”‚   โ”‚   โ”œโ”€โ”€ hist.py             # Histogram-based split finding
โ”‚   โ”‚   โ”œโ”€โ”€ builder.py          # Tree building logic
โ”‚   โ”‚   โ”œโ”€โ”€ booster.py          # Training loop & ensemble
โ”‚   โ”‚   โ”œโ”€โ”€ metrics.py          # Evaluation metrics
โ”‚   โ”‚   โ””โ”€โ”€ utils.py            # Utility functions
โ”‚   โ”œโ”€โ”€ api/                    # Public API
โ”‚   โ”‚   โ”œโ”€โ”€ xgb_classifier.py   # Classifier interface
โ”‚   โ”‚   โ””โ”€โ”€ xgb_regressor.py    # Regressor interface
โ”‚   โ””โ”€โ”€ viz/                    # Visualization tools
โ”‚       โ”œโ”€โ”€ tree_plot.py        # Tree plotting
โ”‚       โ””โ”€โ”€ importance_plot.py  # Feature importance plots
โ”œโ”€โ”€ tests/                      # Comprehensive test suite
โ””โ”€โ”€ docs/                       # Documentation

๐Ÿงช Testing

Run the comprehensive test suite:

# Run all tests
pytest

# Run with coverage
pytest --cov=src/xgb_scratch --cov-report=html

# Run specific test categories
pytest tests/test_objectives.py  # Test objective functions
pytest tests/test_tree.py        # Test tree operations
pytest tests/test_api_smoke.py   # Test public API

๐Ÿ“Š Visualization

Feature Importance

from xgb_scratch.viz import plot_importance

# Plot feature importance
fig = plot_importance(model.get_booster(), importance_type="gain", max_num_features=15)
fig.savefig("importance.png")

Decision Trees

from xgb_scratch.viz import plot_tree

# Plot first tree
booster = model.get_booster()
fig = plot_tree(booster.trees[0], feature_names=feature_names, max_depth=3)
fig.savefig("tree.png")

Training History

from xgb_scratch.viz import plot_training_history

# Plot training curves
fig = plot_training_history(model.evals_result_)
fig.savefig("training_history.png")

Development Setup

# Clone and install in development mode
git clone https://github.com/KodeCharya/xgb-scratch.git
cd xgb-scratch
pip install -e ".[dev,test]"

# Install pre-commit hooks
pre-commit install

# Run tests
pytest

# Run linting
ruff check src/ tests/
black --check src/ tests/
mypy src/

๐Ÿ“š Documentation

  • API Reference: Complete API documentation with examples
  • Tutorials: Step-by-step guides for common use cases
  • Algorithm Details: In-depth explanation of the XGBoost algorithm
  • Performance Guide: Tips for optimal performance and memory usage

๐Ÿ”ฎ Roadmap

Planned Features

  • Categorical feature handling with target encoding
  • Monotonic constraints for feature relationships
  • GPU acceleration (CUDA backend)
  • Distributed training for large datasets
  • Model interpretation tools (SHAP integration)
  • AutoML integration for hyperparameter optimization

Performance Improvements

  • Optimized histogram building with SIMD instructions
  • Memory-mapped datasets for large data handling
  • Incremental learning support
  • Model compression techniques

๐Ÿ™ Acknowledgments

  • XGBoost Team for the original algorithm and implementation
  • scikit-learn for API design inspiration
  • LightGBM for histogram-based optimization techniques
  • CatBoost for categorical feature handling ideas

Built with โค๏ธ for the machine learning community

About

XGBoost (Extreme Gradient Boosting) built from the ground up in Python, featuring gradient boosted decision trees for tabular data with all the essential XGBoost features.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages