Skip to content

Pre-synthesized content and typed ontology for curaitor knowledge base #1

@jdidion

Description

@jdidion

Motivation

Currently, curaitor re-reads full note text every session. Topic notes (e.g., Improving LLM Guidance) can be 500+ lines. Pre-synthesizing key claims, relationships, and summaries would reduce token usage and enable faster reasoning.

Inspired by obsidian-nerv — a typed ontology system for Obsidian vaults with write-time validation. See MCP support issue filed to enable integration.

Design

Pre-synthesis: hash-gated frontmatter + derived index

For article notes (small): add to frontmatter:
```yaml
content_hash: "a3f7b2c1"
llm_summary: "Novel method linking chromatin accessibility to gene expression..."
key_claims:

  • "Outperforms existing multiome prediction by 15%"
    relationships:
  • {type: "extends", target: "[[scMultiPreDICT]]"}
    ```
    LLM reads frontmatter first. If `content_hash` matches body, trust the synthesis. If stale, re-synthesize.

For topic notes (large, relationship-heavy): derive a compiled index:
```yaml

config/knowledge-index.yaml

topics:
Improving LLM Guidance:
hash: "b4e8c3d2"
summary: "Patterns for making AI agents reliable..."
key_themes: [harness-design, context-management, verification]
relationships:
- {to: "AI-Assisted Development", type: "see-also"}
- {to: "Agent Memory Strategies", type: "depends-on"}
```

Typed ontology for notes

Enforce note types and relationships:

  • Article types: article, paper, tool, post
  • Topic types: topic, idea, reference
  • Relationship types: extends, compares-to, depends-on, replaces, contains
  • Write-time validation: triage-write.py validates frontmatter schema

Sync mechanism

  • content_hash in frontmatter/index detects stale synthesis
  • Pre-session script (scripts/synthesize.py) validates hashes, re-synthesizes stale entries
  • Extend existing prefetch-review.py to check synthesis freshness
  • Re-synthesis runs LLM pass only on changed notes (not all)

Implementation steps

  1. Add content_hash + llm_summary + key_claims to triage-write.py note template
  2. Create scripts/synthesize.py — hash-check + LLM synthesis for stale notes
  3. Create config/knowledge-index.yaml — derived index for topic notes
  4. Extend prefetch-review.py to load synthesis data
  5. Add relationship types to reading-prefs or a new ontology config
  6. Optional: integrate with obsidian-nerv if MCP support lands

Open questions

  • Is YAML frontmatter sufficient, or do we need sidecar files for large syntheses?
  • Should relationships be stored per-note or in the central index?
  • How to handle synthesis for notes that are mostly links (e.g., Tools & Projects)?
  • Should the synthesis script run as a cron job or on-demand before review sessions?

Sync problem: three approaches

The core challenge with pre-synthesized content is keeping it in sync with the source text. Three approaches with different tradeoffs:

Option A: Hash-gated frontmatter synthesis

Store synthesis inline in YAML frontmatter alongside a content hash:

content_hash: "a3f7b2c1"  # SHA256 of body text
llm_summary: "Novel method linking chromatin accessibility to gene expression..."
key_claims:
  - "Outperforms existing multiome prediction by 15%"
relationships:
  - {type: extends, target: "[[scMultiPreDICT]]"}
  • LLM reads frontmatter first; if content_hash matches body, trusts the synthesis
  • If hash is stale, re-synthesize on demand
  • Pro: co-located, human can inspect synthesis, single file
  • Con: frontmatter gets large; YAML isn't great for long text

Option B: Sidecar files (.synthesis.yaml)

Separate file alongside each note:

Curaitor/Inbox/SPEAR.md            <- human reads this
Curaitor/Inbox/.SPEAR.synth.yaml   <- LLM reads this
  • Pro: clean separation, no frontmatter bloat, richer structures
  • Con: two files to manage, Obsidian doesn't index hidden files, can drift

Option C: Derived index (single compiled file)

One file for the entire knowledge base:

# config/knowledge-index.yaml
articles:
  SPEAR:
    path: Curaitor/Inbox/SPEAR.md
    hash: a3f7b2c1
    summary: "..."
    claims: [...]
    relationships: [...]
  • Pro: single file for LLM to read, efficient token-wise, queryable
  • Con: another thing to keep in sync; pre-session script validates all hashes

Recommendation

Option A (hash-gated frontmatter) for article notes (small, structured). Option C (derived index) for topic notes (large, relationship-heavy). A pre-session script (scripts/synthesize.py) validates hashes and re-synthesizes stale entries. Extend existing prefetch-review.py to check synthesis freshness.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions