Skip to content

Latest commit

 

History

History
589 lines (473 loc) · 23.2 KB

File metadata and controls

589 lines (473 loc) · 23.2 KB

🏗️ Git Summarizer Architecture

This document describes the architecture, design decisions, and data flow of Git Summarizer.


📖 Table of Contents


Overview

Git Summarizer is designed with the following principles:

  1. Modularity: Each component has a single responsibility
  2. Extensibility: Easy to add new providers and analyzers
  3. Performance: Efficient processing of large repositories
  4. Usability: Beautiful terminal output with Rich

System Architecture

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                            USER                                      │
│                              │                                       │
│                              ▼                                       │
│                    ┌─────────────────┐                              │
│                    │     CLI Layer    │                              │
│                    │  (cli.py + Typer)│                              │
│                    └────────┬────────┘                              │
│                             │                                        │
│              ┌──────────────┼──────────────┐                        │
│              ▼              ▼              ▼                        │
│    ┌─────────────┐  ┌─────────────┐  ┌─────────────┐               │
│    │ GitAnalyzer │  │RiskAnalyzer │  │ Contributor │               │
│    │             │  │             │  │  Analyzer   │               │
│    └──────┬──────┘  └──────┬──────┘  └──────┬──────┘               │
│           │                │                │                        │
│           └────────────────┼────────────────┘                        │
│                            ▼                                         │
│                   ┌─────────────────┐                               │
│                   │  AnalysisResult │                               │
│                   │   (Pydantic)    │                               │
│                   └────────┬────────┘                               │
│                            │                                         │
│                            ▼                                         │
│                   ┌─────────────────┐                               │
│                   │  LLM Provider   │                               │
│                   │    (llm.py)     │                               │
│                   └────────┬────────┘                               │
│                            │                                         │
│              ┌─────────────┴─────────────┐                          │
│              ▼                           ▼                          │
│    ┌─────────────────┐         ┌─────────────────┐                 │
│    │  BaseProvider   │         │ OpenAIProvider  │                 │
│    │   (Abstract)    │◄────────│                 │                 │
│    └─────────────────┘         └─────────────────┘                 │
│                                                                      │
│                            OUTPUT                                    │
│              ┌─────────────┴─────────────┐                          │
│              ▼                           ▼                          │
│    ┌─────────────────┐         ┌─────────────────┐                 │
│    │  Rich Terminal  │         │   JSON Export   │                 │
│    │     Output      │         │                 │                 │
│    └─────────────────┘         └─────────────────┘                 │
└─────────────────────────────────────────────────────────────────────┘

Component Diagram

┌──────────────────────────────────────────────────────────────────┐
│                         gitsum package                            │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │   cli.py    │───▶│ analyzer.py │───▶│     risk.py         │  │
│  │             │    │             │    │                     │  │
│  │ • Commands  │    │ • Load repo │    │ • Score files       │  │
│  │ • Options   │    │ • Process   │    │ • Identify factors  │  │
│  │ • Output    │    │ • Analyze   │    │                     │  │
│  └─────────────┘    └─────────────┘    └─────────────────────┘  │
│         │                  │                                      │
│         │                  ▼                                      │
│         │           ┌─────────────────────┐                      │
│         │           │   contributors.py   │                      │
│         │           │                     │                      │
│         │           │ • Track ownership   │                      │
│         │           │ • Calculate hotspots│                      │
│         │           │ • Bus factor        │                      │
│         │           └─────────────────────┘                      │
│         │                                                         │
│         ▼                                                         │
│  ┌─────────────┐    ┌─────────────────────────────────────────┐ │
│  │   llm.py    │───▶│            providers/                   │ │
│  │             │    │  ┌──────────────┐  ┌──────────────────┐ │ │
│  │ • Orchestrate│   │  │  base.py     │  │ openai_provider  │ │ │
│  │ • Summarize │    │  │  (Abstract)  │◄─│                  │ │ │
│  └─────────────┘    │  └──────────────┘  └──────────────────┘ │ │
│                     └─────────────────────────────────────────┘ │
│                                                                   │
│  ┌─────────────┐    ┌─────────────────────────────────────────┐ │
│  │  utils.py   │    │            models/                      │ │
│  │             │    │  ┌──────────────────────────────────┐  │ │
│  │ • Helpers   │    │  │       history_model.py           │  │ │
│  │ • Stats     │    │  │                                  │  │ │
│  │ • Formatting│    │  │ • CommitInfo                     │  │ │
│  └─────────────┘    │  │ • HistorySummary                 │  │ │
│                     │  │ • RiskFile                       │  │ │
│                     │  │ • ContributorInfo                │  │ │
│                     │  │ • AnalysisResult                 │  │ │
│                     │  └──────────────────────────────────┘  │ │
│                     └─────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

Module Design

cli.py - Command Line Interface

Responsibility: User interaction and output formatting

# Key components:
- app: Typer application instance
- summarize(): Main command handler
- print_*(): Output formatting functions
- save_json_output(): JSON export

Dependencies:

  • typer for CLI framework
  • rich for terminal formatting
  • GitAnalyzer, LLMSummarizer

analyzer.py - Git Analysis Core

Responsibility: Repository analysis and data extraction

class GitAnalyzer:
    def load() -> bool
    def analyze() -> AnalysisResult
    def _process_commits()
    def _analyze_history() -> HistorySummary
    def _find_major_changes() -> list[MajorChange]
    def _analyze_commit_patterns() -> list[CommitPattern]
    def _detect_anomalies() -> list[AnomalyInfo]

Dependencies:

  • gitpython for Git access
  • RiskAnalyzer, ContributorAnalyzer

risk.py - Risk Analysis

Responsibility: File risk scoring and factor identification

class RiskAnalyzer:
    WEIGHTS = {
        "change_frequency": 0.30,
        "lines_changed": 0.20,
        "author_count": 0.15,
        "recency": 0.20,
        "complexity": 0.15,
    }
    
    def analyze() -> list[RiskFile]
    def _calculate_recency_score() -> float
    def _identify_risk_factors() -> list[str]

contributors.py - Contributor Analysis

Responsibility: Developer tracking and knowledge mapping

class ContributorAnalyzer:
    def analyze() -> list[ContributorInfo]
    def get_file_ownership() -> dict
    def get_knowledge_concentration() -> list[dict]
    def get_contributor_timeline() -> list[dict]

def get_bus_factor(contributors, threshold) -> int

llm.py - LLM Orchestration

Responsibility: Coordinate LLM-based summarization

class LLMSummarizer:
    def summarize(analysis: AnalysisResult) -> str
    def is_available() -> bool

providers/ - LLM Provider System

Responsibility: Abstraction layer for different LLM providers

# base.py
class BaseProvider(ABC):
    @abstractmethod
    def generate_summary(sections, style) -> str
    
    @abstractmethod
    def is_available() -> bool
    
    def get_system_prompt(style) -> str
    def format_sections_as_text(sections) -> str
    def build_user_prompt(sections, style) -> str

# openai_provider.py
class OpenAIProvider(BaseProvider):
    def generate_summary(sections, style) -> str
    def _generate_fallback_summary(sections, style) -> str

Data Flow

Analysis Pipeline

Input: Repository Path
         │
         ▼
┌─────────────────┐
│   Load Repo     │  GitPython reads .git
│   (GitAnalyzer) │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Process Commits │  Extract CommitInfo for each commit
│                 │  Track file changes
└────────┬────────┘
         │
         ├─────────────────┐
         ▼                 ▼
┌─────────────────┐ ┌─────────────────┐
│ Analyze History │ │  Detect Patterns │
│ • Timeline      │ │  • Keywords      │
│ • Statistics    │ │  • Anomalies     │
└────────┬────────┘ └────────┬────────┘
         │                   │
         ├───────────────────┘
         │
         ├─────────────────┐
         ▼                 ▼
┌─────────────────┐ ┌─────────────────┐
│  Risk Analysis  │ │  Contributor    │
│  • Score files  │ │  Analysis       │
│  • Find factors │ │  • Hotspots     │
└────────┬────────┘ └────────┬────────┘
         │                   │
         └───────┬───────────┘
                 ▼
        ┌─────────────────┐
        │ AnalysisResult  │  Aggregated data model
        └────────┬────────┘
                 │
                 ▼
        ┌─────────────────┐
        │ LLM Summary     │  Optional AI enhancement
        │ (if available)  │
        └────────┬────────┘
                 │
                 ▼
           ┌───────────┐
           │  Output   │
           │ • Rich UI │
           │ • JSON    │
           └───────────┘

Data Transformation

Git Commits (raw)
       │
       ▼
┌──────────────────┐
│    CommitInfo    │  Normalized commit data
│  • sha           │
│  • author        │
│  • date          │
│  • lines_added   │
│  • impact_score  │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  File Changes    │  Aggregated per file
│  • count         │
│  • lines_added   │
│  • authors       │
│  • last_changed  │
└────────┬─────────┘
         │
    ┌────┴────┐
    ▼         ▼
┌────────┐ ┌──────────────┐
│RiskFile│ │ContributorInfo│
└────────┘ └──────────────┘
    │             │
    └──────┬──────┘
           ▼
   ┌───────────────┐
   │AnalysisResult │
   │ .to_sections()│───▶ LLM Input
   └───────────────┘

Provider System

Provider Interface

class BaseProvider(ABC):
    """Abstract interface for LLM providers."""
    
    def __init__(self, model: str = None, **kwargs):
        self.model = model
        self.config = kwargs
    
    @abstractmethod
    def generate_summary(
        self,
        sections: dict,
        style: SummaryStyle,
    ) -> str:
        """Generate summary from analysis sections."""
        pass
    
    @abstractmethod
    def is_available(self) -> bool:
        """Check if provider is configured."""
        pass

Adding a New Provider

# providers/anthropic_provider.py

from gitsum.providers.base import BaseProvider, SummaryStyle

class AnthropicProvider(BaseProvider):
    DEFAULT_MODEL = "claude-3-opus"
    
    def __init__(self, model=None, api_key=None, **kwargs):
        super().__init__(model or self.DEFAULT_MODEL)
        self.api_key = api_key or os.getenv("ANTHROPIC_API_KEY")
    
    def generate_summary(self, sections, style):
        # Implementation
        pass
    
    def is_available(self):
        return bool(self.api_key)

Registration

# providers/__init__.py

def get_provider(name: str, **kwargs) -> BaseProvider:
    providers = {
        "openai": OpenAIProvider,
        "anthropic": AnthropicProvider,  # Add new provider
    }
    return providers[name](**kwargs)

Data Models

Model Hierarchy

┌─────────────────────────────────────────────────────────────┐
│                      AnalysisResult                          │
│  ┌─────────────────────────────────────────────────────┐    │
│  │                   HistorySummary                     │    │
│  │  • total_commits     • daily_avg_commits            │    │
│  │  • first_commit_date • weekly_avg_commits           │    │
│  │  • last_commit_date  • longest_inactive_days        │    │
│  │  • total_authors     • total_lines_added/deleted    │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────────────┐   │
│  │ MajorChange  │  │   RiskFile   │  │ ContributorInfo │   │
│  │ • commit_sha │  │ • path       │  │ • name          │   │
│  │ • author     │  │ • risk_score │  │ • commit_count  │   │
│  │ • impact     │  │ • factors    │  │ • expertise     │   │
│  └──────────────┘  └──────────────┘  └─────────────────┘   │
│                                                              │
│  ┌──────────────┐  ┌──────────────┐                         │
│  │CommitPattern │  │ AnomalyInfo  │                         │
│  │ • keyword    │  │ • type       │                         │
│  │ • count      │  │ • severity   │                         │
│  │ • percentage │  │ • details    │                         │
│  └──────────────┘  └──────────────┘                         │
└─────────────────────────────────────────────────────────────┘

Pydantic Models

All data models use Pydantic for:

  • Type validation
  • JSON serialization
  • Default values
  • Documentation

Design Decisions

1. GitPython over Shell Commands

Decision: Use GitPython instead of subprocess git calls

Rationale:

  • Type safety and structured data
  • Cross-platform compatibility
  • Better error handling
  • Easier testing

2. Provider Abstraction

Decision: Abstract LLM providers behind a common interface

Rationale:

  • Easy to swap providers
  • Fallback when API unavailable
  • Testing without API calls
  • Future extensibility

3. Rich for Terminal Output

Decision: Use Rich library for all terminal output

Rationale:

  • Beautiful, consistent formatting
  • Tables, panels, progress bars
  • Color support
  • Markdown rendering

4. Pydantic for Data Models

Decision: Use Pydantic models for all data structures

Rationale:

  • Type validation
  • Easy JSON serialization
  • Self-documenting
  • IDE support

5. Weighted Risk Scoring

Decision: Use weighted factors for risk calculation

Rationale:

  • Configurable importance
  • Transparent scoring
  • Easy to explain
  • Adjustable thresholds

Performance Considerations

Large Repository Handling

# Commit limiting
analyzer = GitAnalyzer(path, limit_commits=1000)

# Progress tracking
with Progress() as progress:
    for commit in commits:
        process(commit)
        progress.update(task, advance=1)

Memory Efficiency

  • Stream processing for commits
  • Lazy loading of file stats
  • Aggregate data as we go
  • Don't store full diffs

Optimization Opportunities

  1. Parallel processing: Analyze multiple files concurrently
  2. Caching: Cache analysis results
  3. Incremental analysis: Only process new commits
  4. Sampling: Statistically sample for very large repos

Extension Points

Custom Analyzers

class SecurityAnalyzer:
    """Analyze security-related patterns."""
    
    def analyze(self, commits, file_changes):
        # Look for security keywords
        # Check for sensitive file patterns
        # Return SecurityReport
        pass

Custom Output Formats

class HTMLReporter:
    """Generate HTML report."""
    
    def render(self, analysis: AnalysisResult) -> str:
        # Render Jinja template
        pass

Webhooks/Notifications

class SlackNotifier:
    """Send summary to Slack."""
    
    def notify(self, analysis: AnalysisResult):
        # Post to Slack webhook
        pass

Testing Strategy

Unit Tests

  • Model validation
  • Risk score calculation
  • Contributor aggregation
  • Utility functions

Integration Tests

  • Full pipeline with real repo
  • CLI command execution
  • JSON output validation

Mock Tests

  • Provider API calls
  • Git repository operations

Future Considerations

  1. Multi-repo analysis: Compare across repositories
  2. Time-series tracking: Track metrics over time
  3. CI/CD integration: GitHub Actions, GitLab CI
  4. Web dashboard: Visual reporting interface
  5. Custom rules: User-defined risk factors