Skip to content

omkarmakar/credtech

Repository files navigation

CredTech - Explainable Credit Intelligence Platform

πŸš€ Overview

CredTech is a real-time explainable credit intelligence platform that continuously ingests multi-source financial data to generate dynamic creditworthiness scores. Unlike traditional credit rating agencies that update infrequently, our platform provides real-time, explainable credit assessments with transparent feature-level explanations.

πŸ—οΈ System Architecture

High-Level Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Data Sources  │───▢│  Feature Engine  │───▢│  ML Pipeline    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚                       β”‚
         β–Ό                       β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β€’ Alpha Vantage β”‚    β”‚ β€’ Financial      β”‚    β”‚ β€’ CatBoost      β”‚
β”‚ β€’ News API      β”‚    β”‚   Metrics        β”‚    β”‚ β€’ Neural Nets   β”‚
β”‚ β€’ Finnhub       β”‚    β”‚ β€’ Sentiment      β”‚    β”‚ β€’ SHAP          β”‚
β”‚ β€’ FMP           β”‚    β”‚   Analysis       β”‚    β”‚ β€’ Ensemble      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚                        β”‚
                                β–Ό                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Streamlit Dashboard                          β”‚
β”‚  β€’ Risk Gauges  β€’ SHAP Waterfall  β€’ News Sentiment Timeline     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Econometric Foundation: Beyond Traditional Scoring

Black-Cox Structural Credit Risk Model

What makes us different: We implement the Black-Cox first-passage structural model, a sophisticated econometric approach that models default as the first time a firm's asset value drops below a time-dependent barrier.

def black_cox_pod(V, B, mu, sigma, T=1.0):
    """
    Black-Cox Probability of Default (Structural Credit Risk).
    """
    with np.errstate(divide='ignore', invalid='ignore'):
        A0 = np.log(V / B)
    at = mu - (sigma**2) / 2
    denom = sigma * np.sqrt(T)
    d1 = (-A0 + at * T) / denom
    d2 = (-A0 - at * T) / denom
    pod = norm.cdf(d1) + np.exp((-2 * at * A0) / (sigma**2)) * norm.cdf(d2)
    return pod

Why this matters: Unlike FICO scores that are primarily backward-looking statistical constructs, the Black-Cox model provides an economic foundation by relating default probability to fundamental asset dynamics and market volatility. This approach:

  • Captures continuous default risk rather than point-in-time assessments
  • Incorporates market volatility and asset dynamics
  • Provides theoretical grounding in option pricing theory
  • Enables scenario analysis and stress testing

Comprehensive Financial Metrics Engineering

Corporate Risk Assessment (30+ metrics):

  • Cash-flow Quality: FCF/NI ratio, CapEx/Depreciation ratio
  • Leverage & Solvency: Market leverage, Debt/EBITDA, Interest coverage
  • Liquidity: Cash runway, Quick ratio, Working capital cycle
  • Market Signals: Yield spreads, Beta, Short interest ratios

Sovereign Risk Assessment (15+ metrics):

  • Fiscal Health: Debt/GDP, Primary balance/GDP
  • External Stability: Current account/GDP, Import coverage
  • Debt Dynamics: External debt/exports, Debt service/revenue

Econometric Advantage: These metrics go beyond traditional ratios by incorporating:

  1. Dynamic relationships between cash flows and capital structure
  2. Market-based signals that reflect real-time sentiment
  3. Cross-sectional and time-series analysis capabilities
  4. Scenario-based stress testing frameworks

πŸ€– Advanced Machine Learning Architecture

Multi-Modal Ensemble Approach

Technical Innovation: Our system combines four distinct ML approaches:

  1. Risk Score ANN: 32β†’16β†’1 neural network for base risk assessment
  2. CatBoost Classifier: Gradient boosting optimized for categorical features
  3. Main Neural Network: 128β†’64β†’1 with dropout and batch normalization
  4. Graph Neural Networks: Relationship modeling via GCN layers

Why this is superior:

  • Ensemble robustness: Multiple models reduce single-point-of-failure risk
  • Feature complementarity: Different models capture different aspects of risk
  • Adaptive learning: Neural networks adapt to changing market conditions
  • Graph relationships: Captures systemic risk through network effects

Graph Neural Networks for Systemic Risk

Innovation: Integration of PyTorch Geometric for relationship modeling:

class GNN(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int = 64, output_dim: int = 16):
        super(GNN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

Business Impact: Traditional models ignore interconnectedness. Our GNN approach:

  • Models counterparty relationships and supply chain dependencies
  • Captures contagion effects during market stress
  • Identifies systemic risk clusters in portfolios
  • Enables portfolio-level optimization

BERT-Based Sentiment Integration

Technical Implementation: Multi-modal learning combining numerical and textual data:

  • 768-dimensional BERT embeddings for news sentiment
  • Real-time processing of market communications
  • Attention mechanisms for relevant information extraction

Advantage over competitors: Most credit models ignore unstructured data. Our approach:

  • Incorporates forward-looking sentiment vs. backward-looking financials
  • Processes real-time news flow for immediate risk updates
  • Handles multiple languages and financial jargon
  • Provides interpretable sentiment contributions

πŸ” Explainable AI Without Black Boxes

SHAP-Based Model Interpretability

Technical Implementation:

def explain_prediction(self, X: pd.DataFrame) -> Dict:
    if self.shap_explainer is None:
        return {"error": "SHAP explainer not available"}
    
    shap_values = self.shap_explainer.shap_values(X)
    if isinstance(shap_values, list):
        shap_values = shap_values[1]  # positive class
    
    contributions = {f: float(val) for f, val in zip(self.feature_names, shap_row)}
    return {
        "feature_contributions": contributions,
        "top_risk_factors": positive_factors,
        "top_protective_factors": negative_factors
    }

Regulatory Advantage: Unlike LLM-based explanations that can hallucinate:

  • Mathematically consistent explanations based on game theory
  • Additive feature contributions sum to final prediction
  • Regulatory compliant with GDPR "right to explanation"
  • Stakeholder friendly visualizations via waterfall charts

Fairness-Aware Machine Learning

Implementation: Built-in bias detection and mitigation:

def evaluate_fairness(self, y_true: np.ndarray, y_pred: np.ndarray, sensitive_features: np.ndarray):
    metric_frame = MetricFrame(
        metrics=selection_rate,
        y_true=y_true,
        y_pred=y_pred,
        sensitive_features=sensitive_features
    )
    return metric_frame.by_group

Competitive Advantage: Proactive fairness assessment vs. reactive compliance:

  • Algorithmic auditing across protected characteristics
  • Disparate impact analysis built into model pipeline
  • Fairness-accuracy tradeoff optimization
  • Continuous monitoring for model drift

Component Details

Data Ingestion Layer:

  • Structured Data: Alpha Vantage (financial overview), Finnhub (market data), FMP (financial statements)
  • Unstructured Data: News API (sentiment analysis), real-time news processing
  • Rate Limiting: Built-in fallback mechanisms and caching to handle API limits

Feature Engineering:

  • Financial ratios calculation (FCF/NI, Debt/EBITDA, Quick Ratio, etc.)
  • Sentiment analysis using VADER and TextBlob
  • Corporate metrics computation with Black-Cox probability of default
  • Time-series feature extraction

ML Pipeline:

  • Ensemble Model: CatBoost + Neural Networks + Risk Score ANN
  • Explainability: SHAP TreeExplainer for feature importance
  • Architecture: Multi-modal learning with graph embeddings and text embeddings (BERT)
  • Training: Incremental learning capability with model persistence

Presentation Layer:

  • Interactive Streamlit dashboard
  • Real-time score updates with explanations
  • Visualization suite (Plotly-based gauges, waterfalls, timelines)

πŸ› οΈ Tech Stack

Backend & ML

  • Python 3.11: Core runtime environment
  • PyTorch: Deep learning framework for neural networks
  • CatBoost: Gradient boosting for structured data
  • Transformers: BERT embeddings for text analysis
  • PyTorch Geometric: Graph neural networks
  • SHAP: Model explainability
  • scikit-learn: Feature preprocessing and metrics

Data Processing

  • Pandas: Data manipulation and analysis
  • NumPy: Numerical computations
  • Requests: HTTP API calls
  • TextBlob & VADER: Sentiment analysis
  • Alpha Vantage, News API, Finnhub: External data sources

Frontend & Visualization

  • Streamlit: Web application framework
  • Plotly: Interactive visualizations
  • Custom CSS: Enhanced UI/UX

DevOps & Deployment

  • Docker: Containerization
  • Docker Compose: Multi-service orchestration
  • Joblib: Model serialization
  • Python-dotenv: Environment management

Rationale for Tech Stack Selection

  1. Streamlit over Flask/FastAPI: Rapid prototyping for ML dashboards with built-in interactivity
  2. CatBoost + PyTorch Ensemble: CatBoost excels at tabular data while PyTorch handles multi-modal inputs
  3. SHAP for Explainability: Industry standard for model interpretability without LLM dependency
  4. Docker: Ensures reproducible deployments across environments
  5. Multiple API Sources: Diversified data pipeline reduces single-point-of-failure risk

🐳 Installation & Setup

Docker Installation (Recommended)

Prerequisites

  • Docker Engine 20.0+
  • Docker Compose 1.29+
  • Git

Quick Start

  1. Clone the repository
git clone https://github.com/yourusername/credtech.git
cd credtech
  1. Set up environment variables
# Create .env file with your API keys
cat > .env << EOF
ALPHA_VANTAGE_API_KEY=your_alpha_vantage_key
NEWS_API_KEY=your_news_api_key
FINNHUB_API_KEY=your_finnhub_key
FMP_KEY=your_fmp_key
TWELVEDATA_API_KEY=your_twelvedata_key
EOF
  1. Build and run with Docker Compose
docker-compose up --build
  1. Access the application
  • Open browser to http://localhost:8501
  • The application will automatically train models on first run

Docker Configuration Details

Dockerfile Structure:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py"]

Health Checks:

  • Built-in health monitoring via Streamlit's /_stcore/health
  • Automatic container restart on failure
  • 30s interval health checks with 3 retries

Volume Mounts:

  • ./models:/app/models - Persistent model storage
  • ./data:/app/data - Data cache directory

Local Installation (Alternative)

Prerequisites

  • Python 3.11 (3.9-3.11 supported, avoid 3.13 due to dependency conflicts)
  • pip or conda

Setup Steps

  1. Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies
pip install -r requirements.txt
  1. Configure API keys
cp .env.example .env
# Edit .env file with your API keys
  1. Run the application
streamlit run app.py

πŸ”‘ API Configuration

Required API Keys

Service Purpose Free Tier Limits Signup URL
Alpha Vantage Company financials 5 calls/min, 500/day alphavantage.co
News API News sentiment 1000 requests/day newsapi.org
Finnhub Market data 60 calls/min finnhub.io
FMP Financial statements 250 calls/day financialmodelingprep.com

Configuration Options

For Streamlit Cloud:

  • Add keys to Streamlit secrets management
  • Access via st.secrets["KEY_NAME"]

For Local Development:

  • Use .env file with python-dotenv
  • Environment variables loaded automatically

For Docker:

  • Pass via environment variables in docker-compose.yml
  • Supports .env file in project root

πŸ€– Model Architecture

Advanced Credit Scoring Pipeline

Multi-Model Ensemble:

  1. Risk Score ANN: 32β†’16β†’1 neural network for base risk assessment
  2. CatBoost Classifier: Gradient boosting on engineered features
  3. Main Neural Network: 128β†’64β†’1 with dropout and batch normalization
  4. Ensemble Averaging: Weighted combination of model predictions

Feature Engineering:

  • Financial Metrics: FCF/NI ratio, Debt/EBITDA, Quick Ratio, Market Leverage
  • Structural Model: Black-Cox probability of default calculation
  • Sentiment Features: News sentiment aggregation and volatility
  • Graph Embeddings: Company relationship networks (16-dim)
  • Text Embeddings: BERT-based document representations (768-dim)

Explainability Layer:

  • SHAP Values: Feature contribution analysis
  • Waterfall Charts: Visual impact breakdown
  • Risk Factor Identification: Top positive/negative contributors
  • Plain Language Summaries: Non-technical explanations

Model Performance Metrics

  • Ensemble AUC: >0.85 on validation set
  • Training Time: ~2-3 minutes on CPU
  • Inference Time: <100ms per prediction
  • Model Size: ~50MB serialized

πŸ“Š Key Features

Real-Time Credit Scoring

  • Dynamic Updates: Scores react to market events within minutes
  • Multi-Factor Analysis: 30+ engineered features from diverse data sources
  • Risk Categorization: Low/Medium/High risk classification with confidence scores

Explainable AI

  • SHAP Integration: Feature-level impact analysis without black-box explanations
  • Visual Explanations: Interactive charts showing "why this score"
  • Trend Analysis: Historical risk evolution tracking
  • Event Attribution: Links score changes to specific news/market events

Interactive Dashboard

  • Risk Gauges: Real-time creditworthiness visualization
  • News Sentiment Timeline: Market event impact tracking
  • Feature Importance: Dynamic ranking of risk factors
  • Company Comparison: Side-by-side risk analysis

Data Integration

  • Multi-Source Fusion: Combines financial statements, market data, and news
  • Rate Limit Handling: Intelligent caching and fallback mechanisms
  • Data Quality: Automated cleaning and normalization pipelines
  • Scalability: Designed for dozens of entities across sectors

πŸ”§ System Requirements

Minimum Requirements

  • CPU: 2 cores, 2.0 GHz
  • RAM: 4GB (8GB recommended)
  • Storage: 2GB free space
  • Network: Stable internet for API calls

Recommended for Production

  • CPU: 4+ cores, 3.0 GHz
  • RAM: 16GB
  • Storage: 10GB SSD
  • Network: Low-latency connection (< 100ms to API endpoints)

🚨 Known Limitations & Trade-offs

API Dependencies

  • Rate Limits: Free tier APIs limit real-time capabilities
  • Data Quality: Dependent on external API reliability
  • Cost Scaling: Production usage requires paid API tiers

Model Limitations

  • Training Data: Uses synthetic data for demonstration
  • Cold Start: New entities require initial data accumulation
  • Market Coverage: Optimized for US equity markets

Technical Constraints

  • Python 3.13 Incompatibility: UMAP/Numba dependencies limit Python version
  • Memory Usage: BERT models require significant RAM
  • Compute Requirements: Real-time inference needs adequate CPU

Architectural Trade-offs

Ensemble vs Single Model:

  • βœ… Chosen: Ensemble approach for better accuracy and robustness
  • ❌ Rejected: Single model for simplicity (sacrifices performance)

Streamlit vs Custom Frontend:

  • βœ… Chosen: Streamlit for rapid prototyping and ML-focused UI
  • ❌ Rejected: React/Vue for production-grade UX (development time)

Docker vs Native Deployment:

  • βœ… Chosen: Docker for reproducible, portable deployments
  • ❌ Rejected: Native installation (environment conflicts)

πŸ“ˆ Future Enhancements

Planned Features

  • Real-time WebSocket Updates: Live score streaming
  • Advanced ML Models: Transformer-based time series models
  • Extended Market Coverage: International markets and bonds
  • Alert System: Configurable risk threshold notifications
  • Historical Backtesting: Strategy performance analysis

Scalability Roadmap

  • Microservices Architecture: Separate data, model, and UI services
  • Database Integration: PostgreSQL/TimescaleDB for historical data
  • Kubernetes Deployment: Container orchestration for production
  • CDN Integration: Global content delivery optimization

🀝 Contributors

We would like to thank all the amazing contributors who have been part of this project:

πŸ† Hackathon Context

Developed for the CredTech Hackathon organized by The Programming Club, IIT Kanpur, and powered by Deep Root Investments. This platform addresses the challenge of creating transparent, real-time credit intelligence to replace opaque traditional rating methodologies.


Built with ❀️ for transparent financial intelligence

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors