🏗️ Git Summarizer Architecture

This document describes the architecture, design decisions, and data flow of Git Summarizer.

📖 Table of Contents

Overview
System Architecture
Module Design
Data Flow
Provider System
Data Models
Design Decisions

Overview

Git Summarizer is designed with the following principles:

Modularity: Each component has a single responsibility
Extensibility: Easy to add new providers and analyzers
Performance: Efficient processing of large repositories
Usability: Beautiful terminal output with Rich

System Architecture

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                            USER                                      │
│                              │                                       │
│                              ▼                                       │
│                    ┌─────────────────┐                              │
│                    │     CLI Layer    │                              │
│                    │  (cli.py + Typer)│                              │
│                    └────────┬────────┘                              │
│                             │                                        │
│              ┌──────────────┼──────────────┐                        │
│              ▼              ▼              ▼                        │
│    ┌─────────────┐  ┌─────────────┐  ┌─────────────┐               │
│    │ GitAnalyzer │  │RiskAnalyzer │  │ Contributor │               │
│    │             │  │             │  │  Analyzer   │               │
│    └──────┬──────┘  └──────┬──────┘  └──────┬──────┘               │
│           │                │                │                        │
│           └────────────────┼────────────────┘                        │
│                            ▼                                         │
│                   ┌─────────────────┐                               │
│                   │  AnalysisResult │                               │
│                   │   (Pydantic)    │                               │
│                   └────────┬────────┘                               │
│                            │                                         │
│                            ▼                                         │
│                   ┌─────────────────┐                               │
│                   │  LLM Provider   │                               │
│                   │    (llm.py)     │                               │
│                   └────────┬────────┘                               │
│                            │                                         │
│              ┌─────────────┴─────────────┐                          │
│              ▼                           ▼                          │
│    ┌─────────────────┐         ┌─────────────────┐                 │
│    │  BaseProvider   │         │ OpenAIProvider  │                 │
│    │   (Abstract)    │◄────────│                 │                 │
│    └─────────────────┘         └─────────────────┘                 │
│                                                                      │
│                            OUTPUT                                    │
│              ┌─────────────┴─────────────┐                          │
│              ▼                           ▼                          │
│    ┌─────────────────┐         ┌─────────────────┐                 │
│    │  Rich Terminal  │         │   JSON Export   │                 │
│    │     Output      │         │                 │                 │
│    └─────────────────┘         └─────────────────┘                 │
└─────────────────────────────────────────────────────────────────────┘

Component Diagram

┌──────────────────────────────────────────────────────────────────┐
│                         gitsum package                            │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │   cli.py    │───▶│ analyzer.py │───▶│     risk.py         │  │
│  │             │    │             │    │                     │  │
│  │ • Commands  │    │ • Load repo │    │ • Score files       │  │
│  │ • Options   │    │ • Process   │    │ • Identify factors  │  │
│  │ • Output    │    │ • Analyze   │    │                     │  │
│  └─────────────┘    └─────────────┘    └─────────────────────┘  │
│         │                  │                                      │
│         │                  ▼                                      │
│         │           ┌─────────────────────┐                      │
│         │           │   contributors.py   │                      │
│         │           │                     │                      │
│         │           │ • Track ownership   │                      │
│         │           │ • Calculate hotspots│                      │
│         │           │ • Bus factor        │                      │
│         │           └─────────────────────┘                      │
│         │                                                         │
│         ▼                                                         │
│  ┌─────────────┐    ┌─────────────────────────────────────────┐ │
│  │   llm.py    │───▶│            providers/                   │ │
│  │             │    │  ┌──────────────┐  ┌──────────────────┐ │ │
│  │ • Orchestrate│   │  │  base.py     │  │ openai_provider  │ │ │
│  │ • Summarize │    │  │  (Abstract)  │◄─│                  │ │ │
│  └─────────────┘    │  └──────────────┘  └──────────────────┘ │ │
│                     └─────────────────────────────────────────┘ │
│                                                                   │
│  ┌─────────────┐    ┌─────────────────────────────────────────┐ │
│  │  utils.py   │    │            models/                      │ │
│  │             │    │  ┌──────────────────────────────────┐  │ │
│  │ • Helpers   │    │  │       history_model.py           │  │ │
│  │ • Stats     │    │  │                                  │  │ │
│  │ • Formatting│    │  │ • CommitInfo                     │  │ │
│  └─────────────┘    │  │ • HistorySummary                 │  │ │
│                     │  │ • RiskFile                       │  │ │
│                     │  │ • ContributorInfo                │  │ │
│                     │  │ • AnalysisResult                 │  │ │
│                     │  └──────────────────────────────────┘  │ │
│                     └─────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

Module Design

cli.py - Command Line Interface

Responsibility: User interaction and output formatting

# Key components:
- app: Typer application instance
- summarize(): Main command handler
- print_*(): Output formatting functions
- save_json_output(): JSON export

Dependencies:

typer for CLI framework
rich for terminal formatting
GitAnalyzer, LLMSummarizer

analyzer.py - Git Analysis Core

Responsibility: Repository analysis and data extraction

class GitAnalyzer:
    def load() -> bool
    def analyze() -> AnalysisResult
    def _process_commits()
    def _analyze_history() -> HistorySummary
    def _find_major_changes() -> list[MajorChange]
    def _analyze_commit_patterns() -> list[CommitPattern]
    def _detect_anomalies() -> list[AnomalyInfo]

Dependencies:

gitpython for Git access
RiskAnalyzer, ContributorAnalyzer

risk.py - Risk Analysis

Responsibility: File risk scoring and factor identification

class RiskAnalyzer:
    WEIGHTS = {
        "change_frequency": 0.30,
        "lines_changed": 0.20,
        "author_count": 0.15,
        "recency": 0.20,
        "complexity": 0.15,
    }
    
    def analyze() -> list[RiskFile]
    def _calculate_recency_score() -> float
    def _identify_risk_factors() -> list[str]

contributors.py - Contributor Analysis

Responsibility: Developer tracking and knowledge mapping

class ContributorAnalyzer:
    def analyze() -> list[ContributorInfo]
    def get_file_ownership() -> dict
    def get_knowledge_concentration() -> list[dict]
    def get_contributor_timeline() -> list[dict]

def get_bus_factor(contributors, threshold) -> int

llm.py - LLM Orchestration

Responsibility: Coordinate LLM-based summarization

class LLMSummarizer:
    def summarize(analysis: AnalysisResult) -> str
    def is_available() -> bool

providers/ - LLM Provider System

Responsibility: Abstraction layer for different LLM providers

# base.py
class BaseProvider(ABC):
    @abstractmethod
    def generate_summary(sections, style) -> str
    
    @abstractmethod
    def is_available() -> bool
    
    def get_system_prompt(style) -> str
    def format_sections_as_text(sections) -> str
    def build_user_prompt(sections, style) -> str

# openai_provider.py
class OpenAIProvider(BaseProvider):
    def generate_summary(sections, style) -> str
    def _generate_fallback_summary(sections, style) -> str

Data Flow

Analysis Pipeline

Input: Repository Path
         │
         ▼
┌─────────────────┐
│   Load Repo     │  GitPython reads .git
│   (GitAnalyzer) │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Process Commits │  Extract CommitInfo for each commit
│                 │  Track file changes
└────────┬────────┘
         │
         ├─────────────────┐
         ▼                 ▼
┌─────────────────┐ ┌─────────────────┐
│ Analyze History │ │  Detect Patterns │
│ • Timeline      │ │  • Keywords      │
│ • Statistics    │ │  • Anomalies     │
└────────┬────────┘ └────────┬────────┘
         │                   │
         ├───────────────────┘
         │
         ├─────────────────┐
         ▼                 ▼
┌─────────────────┐ ┌─────────────────┐
│  Risk Analysis  │ │  Contributor    │
│  • Score files  │ │  Analysis       │
│  • Find factors │ │  • Hotspots     │
└────────┬────────┘ └────────┬────────┘
         │                   │
         └───────┬───────────┘
                 ▼
        ┌─────────────────┐
        │ AnalysisResult  │  Aggregated data model
        └────────┬────────┘
                 │
                 ▼
        ┌─────────────────┐
        │ LLM Summary     │  Optional AI enhancement
        │ (if available)  │
        └────────┬────────┘
                 │
                 ▼
           ┌───────────┐
           │  Output   │
           │ • Rich UI │
           │ • JSON    │
           └───────────┘

Data Transformation

Git Commits (raw)
       │
       ▼
┌──────────────────┐
│    CommitInfo    │  Normalized commit data
│  • sha           │
│  • author        │
│  • date          │
│  • lines_added   │
│  • impact_score  │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  File Changes    │  Aggregated per file
│  • count         │
│  • lines_added   │
│  • authors       │
│  • last_changed  │
└────────┬─────────┘
         │
    ┌────┴────┐
    ▼         ▼
┌────────┐ ┌──────────────┐
│RiskFile│ │ContributorInfo│
└────────┘ └──────────────┘
    │             │
    └──────┬──────┘
           ▼
   ┌───────────────┐
   │AnalysisResult │
   │ .to_sections()│───▶ LLM Input
   └───────────────┘

Provider System

Provider Interface

class BaseProvider(ABC):
    """Abstract interface for LLM providers."""
    
    def __init__(self, model: str = None, **kwargs):
        self.model = model
        self.config = kwargs
    
    @abstractmethod
    def generate_summary(
        self,
        sections: dict,
        style: SummaryStyle,
    ) -> str:
        """Generate summary from analysis sections."""
        pass
    
    @abstractmethod
    def is_available(self) -> bool:
        """Check if provider is configured."""
        pass

Adding a New Provider

# providers/anthropic_provider.py

from gitsum.providers.base import BaseProvider, SummaryStyle

class AnthropicProvider(BaseProvider):
    DEFAULT_MODEL = "claude-3-opus"
    
    def __init__(self, model=None, api_key=None, **kwargs):
        super().__init__(model or self.DEFAULT_MODEL)
        self.api_key = api_key or os.getenv("ANTHROPIC_API_KEY")
    
    def generate_summary(self, sections, style):
        # Implementation
        pass
    
    def is_available(self):
        return bool(self.api_key)

Registration

# providers/__init__.py

def get_provider(name: str, **kwargs) -> BaseProvider:
    providers = {
        "openai": OpenAIProvider,
        "anthropic": AnthropicProvider,  # Add new provider
    }
    return providers[name](**kwargs)

Data Models

Model Hierarchy

┌─────────────────────────────────────────────────────────────┐
│                      AnalysisResult                          │
│  ┌─────────────────────────────────────────────────────┐    │
│  │                   HistorySummary                     │    │
│  │  • total_commits     • daily_avg_commits            │    │
│  │  • first_commit_date • weekly_avg_commits           │    │
│  │  • last_commit_date  • longest_inactive_days        │    │
│  │  • total_authors     • total_lines_added/deleted    │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────────────┐   │
│  │ MajorChange  │  │   RiskFile   │  │ ContributorInfo │   │
│  │ • commit_sha │  │ • path       │  │ • name          │   │
│  │ • author     │  │ • risk_score │  │ • commit_count  │   │
│  │ • impact     │  │ • factors    │  │ • expertise     │   │
│  └──────────────┘  └──────────────┘  └─────────────────┘   │
│                                                              │
│  ┌──────────────┐  ┌──────────────┐                         │
│  │CommitPattern │  │ AnomalyInfo  │                         │
│  │ • keyword    │  │ • type       │                         │
│  │ • count      │  │ • severity   │                         │
│  │ • percentage │  │ • details    │                         │
│  └──────────────┘  └──────────────┘                         │
└─────────────────────────────────────────────────────────────┘

Pydantic Models

All data models use Pydantic for:

Type validation
JSON serialization
Default values
Documentation

Design Decisions

1. GitPython over Shell Commands

Decision: Use GitPython instead of subprocess git calls

Rationale:

Type safety and structured data
Cross-platform compatibility
Better error handling
Easier testing

2. Provider Abstraction

Decision: Abstract LLM providers behind a common interface

Rationale:

Easy to swap providers
Fallback when API unavailable
Testing without API calls
Future extensibility

3. Rich for Terminal Output

Decision: Use Rich library for all terminal output

Rationale:

Beautiful, consistent formatting
Tables, panels, progress bars
Color support
Markdown rendering

4. Pydantic for Data Models

Decision: Use Pydantic models for all data structures

Rationale:

Type validation
Easy JSON serialization
Self-documenting
IDE support

5. Weighted Risk Scoring

Decision: Use weighted factors for risk calculation

Rationale:

Configurable importance
Transparent scoring
Easy to explain
Adjustable thresholds

Performance Considerations

Large Repository Handling

# Commit limiting
analyzer = GitAnalyzer(path, limit_commits=1000)

# Progress tracking
with Progress() as progress:
    for commit in commits:
        process(commit)
        progress.update(task, advance=1)

Memory Efficiency

Stream processing for commits
Lazy loading of file stats
Aggregate data as we go
Don't store full diffs

Optimization Opportunities

Parallel processing: Analyze multiple files concurrently
Caching: Cache analysis results
Incremental analysis: Only process new commits
Sampling: Statistically sample for very large repos

Extension Points

Custom Analyzers

class SecurityAnalyzer:
    """Analyze security-related patterns."""
    
    def analyze(self, commits, file_changes):
        # Look for security keywords
        # Check for sensitive file patterns
        # Return SecurityReport
        pass

Custom Output Formats

class HTMLReporter:
    """Generate HTML report."""
    
    def render(self, analysis: AnalysisResult) -> str:
        # Render Jinja template
        pass

Webhooks/Notifications

class SlackNotifier:
    """Send summary to Slack."""
    
    def notify(self, analysis: AnalysisResult):
        # Post to Slack webhook
        pass

Testing Strategy

Unit Tests

Model validation
Risk score calculation
Contributor aggregation
Utility functions

Integration Tests

Full pipeline with real repo
CLI command execution
JSON output validation

Mock Tests

Provider API calls
Git repository operations

Future Considerations

Multi-repo analysis: Compare across repositories
Time-series tracking: Track metrics over time
CI/CD integration: GitHub Actions, GitLab CI
Web dashboard: Visual reporting interface
Custom rules: User-defined risk factors

Uh oh!

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

🏗️ Git Summarizer Architecture

📖 Table of Contents

Overview

System Architecture

High-Level Architecture

Component Diagram

Module Design

cli.py - Command Line Interface

analyzer.py - Git Analysis Core

risk.py - Risk Analysis

contributors.py - Contributor Analysis

llm.py - LLM Orchestration

providers/ - LLM Provider System

Data Flow

Analysis Pipeline

Data Transformation

Provider System

Provider Interface

Adding a New Provider

Registration

Data Models

Model Hierarchy

Pydantic Models

Design Decisions

1. GitPython over Shell Commands

2. Provider Abstraction

3. Rich for Terminal Output

4. Pydantic for Data Models

5. Weighted Risk Scoring

Performance Considerations

Large Repository Handling

Memory Efficiency

Optimization Opportunities

Extension Points

Custom Analyzers

Custom Output Formats

Webhooks/Notifications

Testing Strategy

Unit Tests

Integration Tests

Mock Tests

Future Considerations