Skip to content

[nightshift] bus-factor: contributor risk analysis #19

@Microck

Description

@Microck

Automated by Nightshift v3 (GLM 5.1).

Bus Factor Analysis Report: Microck/traccia

Repository: Microck/traccia
Language: Python (3.12+)
Size: ~385 KB source (~6,710 lines in src/traccia/, ~2,208 lines in tests/)
Date: 2026-04-21


Executive Summary

Bus Factor: 1 — The repository has a single contributor (Microck) who authored 100% of the codebase in what appears to be a single merged commit. Every module, every design decision, and every piece of domain logic exists in one person's head. This is the most extreme bus factor risk possible.

The codebase is well-structured, with clear separation of concerns across 17 source modules and strong architectural documentation (CLAUDE.md, docs/). However, several modules are large and complex enough that onboarding a second contributor would require significant effort, particularly around the pipeline orchestration, LLM backend integration, and scoring heuristics.


1. Contributor Analysis

Metric Value
Total contributors 1 (Microck)
Commits visible (shallow clone) 1 (squash-merge)
% of code by primary author 100%
External contributions None
CODEOWNERS file Absent
Contribution guidelines (CONTRIBUTING.md) Absent

Severity: P0 — Complete knowledge concentration in a single individual.


2. Module Dependency Graph

2.1 Fan-In (Most Depended-Upon — Highest Risk if Broken)

Module Depended On By Role
models 12 modules Pydantic data models (296 LOC, 28 classes)
config 8 modules YAML configuration loading (96 LOC)
utils 5 modules Shared utility functions (45 LOC)
taxonomy 4 modules Skill name/pattern definitions (123 LOC)
storage 3 modules SQLite persistence layer (517 LOC)
pipeline_support 2 modules Scoring heuristics (163 LOC)
rendering 2 modules Markdown/HTML/JSON generation (930 LOC)
llm 2 modules LLM backend abstraction (506 LOC)

2.2 Fan-Out (Most Complex — Highest Cognitive Load)

Module Dependencies LOC Functions
pipeline 10 modules 1,251 59
cli 6 modules 432 31
llm 5 modules 506 33
parsers 5 modules 728 27
rendering 4 modules 930 26

2.3 Critical Dependency Chains

cli → pipeline → {storage, llm, parsers, source_detection, rendering, pipeline_support, taxonomy}
                   llm → {extraction, pipeline_support, taxonomy}
                   parsers → {document_normalizer, family_normalizer}
                   rendering → {storage}

The pipeline module is the central orchestrator and has the highest coupling (10 direct dependencies). Any failure in understanding this module blocks comprehension of the entire data flow.


3. Single-Point-of-Failure Modules

P0: pipeline.py (1,251 LOC, 59 functions)

Risk: This is the application's brain. It orchestrates:

  • Source discovery and import
  • Batch ingestion with progress tracking
  • Evidence extraction via LLM backends
  • Skill canonicalization (LLM-based deduplication)
  • Skill scoring (LLM-based level assignment)
  • Graph recomputation
  • Review queue management
  • Archive handling (ZIP)
  • File watching

Why it's critical: Every end-to-end workflow flows through this file. It directly depends on 10 other modules and contains the most complex state management in the codebase (e.g., recompute_graph() at ~170 lines, ingest_directory() at ~135 lines).

Mitigation:

  • Extract recompute_graph() into its own module (e.g., graph_builder.py)
  • Extract the ingestion loop into ingest_worker.py
  • Add integration-level documentation/architecture decision records
  • Target: No single file > 400 LOC

P0: rendering.py (930 LOC, 26 functions)

Risk: Contains both data transformation logic (graph/tree payloads) and output generation (Markdown, JSON, HTML viewer, Obsidian export). The embedded HTML/CSS/JS viewer (~500 lines of inline template) makes this file extremely difficult to navigate and review.

Mitigation:

  • Move the viewer HTML template to a separate file or template engine
  • Split Obsidian export into rendering/obsidian.py
  • Split node page generation into rendering/nodes.py

P1: parsers.py (728 LOC, 27 functions)

Risk: Multi-format parsing (Markdown, JSON, CSV, AI conversation exports, Reddit exports, Google Activity). Contains subtle detection heuristics (_looks_like_ai_conversation_export, _looks_like_reddit_export) that are fragile and domain-specific.

Mitigation:

  • Split format-specific parsers into parsers/ package (one file per format family)
  • Add golden test fixtures for each detection heuristic

P1: storage.py (517 LOC, 28 functions)

Risk: Single SQLite persistence layer with schema migrations inline. The _ensure_columns pattern is an ad-hoc migration system that only adds columns — no versioned migrations, no rollback capability.

Mitigation:

  • Introduce a proper migration framework (e.g., version-numbered SQL scripts)
  • Add schema version tracking

P1: llm.py (506 LOC, 33 functions)

Risk: Dual-backend implementation (FakeLLM + OpenAI-Compatible) with complex retry logic, JSON repair, curl fallback, and structured output handling. The curl subprocess-based HTTP client is unusual and could be a source of subtle bugs.

Mitigation:

  • Use httpx or requests instead of subprocess curl
  • Separate the HTTP transport layer from the LLM protocol layer

P2: family_normalizer.py (449 LOC, 18 functions)

Risk: Platform-specific normalization for 6 export families (Google Takeout, Twitter, Reddit, Instagram, Facebook, Discord). Each family has unique parsing rules. Only the original author would know the edge cases.

Mitigation:

  • One file per family under family_normalizer/ package
  • Document expected input schemas for each family

4. Knowledge Concentration Analysis

4.1 Domain-Specific Business Logic (High Risk)

Area Location Description
Skill scoring formula pipeline_support.py:106-153 The level-capping and scoring algorithm — this IS the product
Evidence weight tables pipeline_support.py:22-42, extraction.py:29-40 Numeric constants that define evidence quality
Source family detection source_detection.py (252 LOC) Marker-based heuristics for 6 platform export formats
Signal classification extraction.py:164-192 Maps source categories to signal classes
Reliability tiers extraction.py:129-142 Maps evidence types + categories to reliability

These are not standard library patterns — they encode domain-specific decisions about how to evaluate skill evidence. No one but the original author understands why these specific thresholds were chosen.

4.2 Undocumented Design Decisions

  • Why consumption_max_level = 2? (Config default — no ADR)
  • Why support_score >= 1.2 for non-taxonomy skills? (pipeline_support.py:75)
  • Why 0.85 printability threshold for text sniffing? (parsers.py:168)
  • Why MAX_EXTRACTION_SPANS = 6? (pipeline.py:91)
  • Why inline curl subprocess instead of Python HTTP? (llm.py:316)

4.3 Implicit Architecture Decisions

  • Single-file SQLite database (no concurrent access design)
  • No database migrations framework (only additive column additions)
  • Evidence extraction is per-file, never cross-file (CLAUDE.md rule)
  • LLM prompts are loaded from files at runtime (no prompt versioning)
  • Review queue is both SQLite-backed AND file-backed (JSONL sync)

5. Testing Coverage Analysis

Source Module LOC Test File Test LOC Ratio
pipeline.py 1,251 test_pipeline.py 773 0.62x
parsers.py 728 test_parsers.py 509 0.70x
models.py 296 test_models.py 32 0.11x
source_detection.py 252 test_source_detection.py 66 0.26x
storage.py 517 (none) 0 0x
llm.py 506 (none) 0 0x
rendering.py 930 (none) 0 0x
family_normalizer.py 449 (none) 0 0x
document_normalizer.py 287 (none) 0 0x
cli.py 432 (none) 0 0x
pipeline_support.py 163 (none) 0 0x
config.py 96 (via test_init.py) 478

Critical gaps: storage.py, llm.py, rendering.py, family_normalizer.py, and cli.py have zero direct test coverage. These contain 3,090 lines of code — 46% of the source base.


6. Risk Summary Table

ID Severity Module/File Risk Impact
BF-01 P0 Entire codebase Single contributor (bus factor = 1) Project dies if contributor leaves
BF-02 P0 pipeline.py (1,251 LOC) God-module: 10 deps, 59 functions, all workflows No one can modify the pipeline without deep study
BF-03 P0 pipeline_support.py:66-84 Scoring threshold constants unexplained Wrong tweaks silently corrupt skill levels
BF-04 P1 rendering.py (930 LOC) Mixed concerns: data transform + templates + export Hard to review, easy to break output
BF-05 P1 parsers.py (728 LOC) Multi-format parser with fragile heuristics New formats require understanding all existing ones
BF-06 P1 storage.py (517 LOC) No migration framework, no versioned schema Schema changes are risky and irreversible
BF-07 P1 llm.py:316-366 Subprocess curl for HTTP requests Security and reliability risk
BF-08 P1 6 modules (3,090 LOC) Zero test coverage Changes are unverified
BF-09 P2 family_normalizer.py 6 platform export normalizers in one file Each platform's edge cases are entangled
BF-10 P2 No CODEOWNERS / CONTRIBUTING.md No documented contribution process New contributors have no guidance
BF-11 P2 No ADR documents Design decisions only in the author's head Tribal knowledge, not institutional
BF-12 P3 taxonomy.py Only 12 skills defined Taxonomy growth requires understanding the pattern

7. Actionable Recommendations

Immediate (Before Adding Contributors)

  1. Add CODEOWNERS and CONTRIBUTING.md
    Define who can review, how to submit PRs, and coding conventions.

  2. Write Architecture Decision Records (ADRs)
    Create docs/decisions/ with records for:

    • Why SQLite (not Postgres/duckdb)
    • Why per-file extraction (not cross-file)
    • Why curl subprocess for HTTP
    • Scoring formula rationale
    • Evidence weight table rationale
  3. Split pipeline.py into a package

    pipeline/
      __init__.py       # re-exports Pipeline class
      core.py           # Pipeline class with add/ingest methods
      graph_builder.py  # recompute_graph()
      ingest.py         # _ingest_material, _discover_materials
      manifest.py       # _write_ingest_manifest, BatchResult
    
  4. Extract the HTML viewer from rendering.py
    Move the inline HTML/CSS/JS to viewer_template.html and load it at runtime.

Short-Term (First Month of Multi-Contributor Phase)

  1. Add tests for untested critical modules
    Priority: storage.pyllm.pyrendering.pypipeline_support.py

  2. Replace curl subprocess with httpx
    Use httpx with retry/timeout configuration. This removes a subprocess dependency, simplifies error handling, and enables async future use.

  3. Split parsers.py into a package

    parsers/
      __init__.py         # re-exports parse_document
      core.py             # parse_document, _parse_source_content
      text.py             # _read_text, sniff_text_bytes
      json_exports.py     # AI conversation, Reddit, Google Activity parsers
      csv_parser.py       # CSV handling
      segmentation.py     # _segment_text
    
  4. Introduce schema versioning for SQLite
    Add a schema_version table and numbered migration scripts.

Medium-Term

  1. Add integration test suite
    Golden-file tests: ingest a corpus → verify evidence JSON → verify graph JSON → verify rendered markdown.

  2. Document scoring formula with examples
    Create a notebook or markdown doc showing example evidence items and their computed scores, explaining why each threshold was chosen.


8. Module Criticality Ranking

Rank Module Why Critical
1 pipeline.py Central orchestrator; all workflows pass through it
2 pipeline_support.py Contains the scoring formula — the product's core IP
3 models.py Data contracts; breaking changes cascade to 12 modules
4 storage.py Persistence layer; data loss here is catastrophic
5 extraction.py Evidence classification logic; quality depends on correct heuristics
6 llm.py LLM integration; correctness of structured output parsing is fragile
7 parsers.py Input handling; format support defines what the tool can ingest
8 rendering.py Output generation; user-facing artifact quality
9 taxonomy.py Skill definitions; must grow correctly
10 family_normalizer.py Platform-specific parsing; breadth of supported sources
11 config.py Configuration schema; relatively simple
12 source_detection.py Archive/format detection; mostly data-driven
13 document_normalizer.py PDF/DOCX handling; well-isolated
14 utils.py Pure utilities; easily understood
15 cli.py Thin wrapper; easily regenerated from API
16 bootstrap.py Project initialization; runs once

Conclusion

Traccia has a bus factor of 1 with extreme knowledge concentration. The codebase quality is high — clean architecture, strict Pydantic models, good separation of parsing from logic. However, the combination of single-authorship, several large monolithic modules (especially pipeline.py at 1,251 LOC), undocumented scoring thresholds, and 46% of source code lacking direct tests creates significant resilience risk.

The highest-priority actions are: (1) decompose pipeline.py, (2) document the scoring rationale, and (3) add tests for storage.py and llm.py. These three actions would meaningfully reduce the onboarding time for a second contributor from weeks to days.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions