Author: Morpheus (SDK Dev) Date: 2026-04-06 Status: Research Document — No implementations. Proposals only.
Andrej Karpathy's tweet outlined a paradigm shift: using LLMs not just for Q&A, but as knowledge compilers — tireless librarians that ingest raw data and produce structured, navigable knowledge.
raw/ folder → LLM "compilation" → Markdown wiki → Obsidian as IDE
(papers, code, (summarize, link, (structured, (graph view,
images, tweets) backlink, index) interlinked) search, navigate)
↓
Q&A & search → New outputs
(LLM queries (slides, reports,
its own wiki) new articles)
- LLM as compiler, not assistant — Raw data goes in, structured knowledge comes out. The LLM does the filing, linking, and organizing autonomously.
- Obsidian as the IDE for knowledge — Just as VS Code is the IDE for code, Obsidian (with its graph view and backlinks) becomes the IDE for navigating compiled knowledge.
- No RAG needed at PKB scale — At ~100 articles / 400K words, structured Markdown with indexes is sufficient. RAG and vector DBs are only needed at much larger scale.
- Health checks for consistency — LLMs run periodic "lint" passes over the wiki to fix broken links, merge duplicates, and ensure consistency.
- Knowledge manipulation > code manipulation — The fundamental shift: we're moving from manipulating instructions (code) to manipulating understanding (knowledge graphs).
- Minimal human intervention — Humans steer direction; LLMs handle the tedium of organizing, linking, and maintaining.
The original Python project has evolved significantly since inspiring this .NET port:
- Two-pass architecture: Deterministic AST extraction (tree-sitter) + parallel LLM subagents (Claude) for semantic relationships
- Auto-rebuild / watch mode: File watcher that rebuilds the graph on code changes without requiring an LLM
- Token efficiency: 70x fewer tokens per query vs. reading raw files — critical for cost-effective LLM usage
- Vis.js interactive HTML: Click-through graph visualization with community filtering and search
- Coding agent skill: Available as a drop-in skill for Claude Code, Codex, OpenCode, and OpenClaw
- PyPI distribution:
pip install graphifyy(namespace reclaim in progress) - Confidence tagging: EXTRACTED / INFERRED / AMBIGUOUS on every relationship (we already have this in our schema)
What we can learn: Watch mode, agent skill packaging, and token efficiency benchmarks are features we should prioritize.
The Karpathy tweet spawned a wave of tools and curated resources:
| Project | Description | Relevance to graphify-dotnet |
|---|---|---|
| awesome-llm-knowledge-bases | Curated list: ingestion, wiki compilation, linting, RAG, agents | Reference for ecosystem positioning |
| awesome-llm-knowledge-systems | RAG, context engineering, agent memory, MCP | Architecture inspiration for MCP integration |
| AnythingLLM | Local/private ChatGPT with personal KB, workspace-based RAG | Shows demand for local-first, private knowledge systems |
| knowledge-base-builder | Transform docs to structured Markdown via LLM summarization | Validates the "LLM as compiler" pattern |
| cognee | Python repo → knowledge graph pipeline | Direct competitor in the code-to-graph space |
Key trend: The community is converging on graph-structured knowledge + LLM compilation + local-first privacy. graphify-dotnet is well-positioned in this space as the .NET native option.
The .NET AI landscape has matured rapidly:
- Microsoft.Extensions.AI is now the foundation layer (
IChatClient,IEmbeddingGenerator) — we already use this - Semantic Kernel sits above ME.AI as an orchestration layer with planning, memory, and multi-agent workflows
- Microsoft Agent Framework (MAF) is the newest unified framework, built directly on
IChatClient - OllamaSharp is the recommended .NET Ollama client (replacing
Microsoft.Extensions.AI.Ollamapreview) - .NET Aspire 9.4 offers interactive dashboards with custom commands, OTEL metrics, and AI integration
- Tree-sitter .NET bindings have matured:
TreeSitter.DotNet(28+ grammars) andTreeSitterLanguagePack(248+ languages)
- Description: Package Graphify.Cli as a global .NET tool so users can install with
dotnet tool install -g graphifyand run with justgraphify run . - Rationale: Eliminates the clone-and-build workflow. The Python version is on PyPI (
pip install graphifyy); we need parity. Global tools are the standard .NET distribution mechanism for CLI apps. - How: Add
<PackAsTool>true</PackAsTool>and<ToolCommandName>graphify</ToolCommandName>to Graphify.Cli.csproj. Consider Native AOT for self-contained packaging. - Difficulty: Easy
- Priority: High
- Description: Add IChatClient configuration for Azure OpenAI endpoints alongside the existing AI provider support.
- Rationale: Enterprise users often have Azure OpenAI deployments with managed access, private endpoints, and compliance guarantees. This is table-stakes for enterprise adoption.
- How: Use
Microsoft.Extensions.AI.OpenAIwith Azure OpenAI endpoint configuration. TheIChatClientabstraction means this is mostly configuration, not new extraction logic. - Difficulty: Easy
- Priority: High
- Description: Support local LLM inference via Ollama for offline, private, and cost-free semantic extraction.
- Rationale: Aligns with Karpathy's vision of local-first knowledge systems. Privacy-sensitive codebases (healthcare, finance, defense) cannot send code to cloud APIs. OllamaSharp is now the recommended .NET Ollama client and implements
IChatClientnatively. - How: Add
OllamaSharpNuGet package. ConfigureIChatClientwithnew OllamaApiClient(new Uri("http://localhost:11434/"), modelId). Same extraction pipeline, different provider. - Difficulty: Easy
- Priority: High
- Description: File watcher that monitors the target directory and re-processes only changed files, updating the knowledge graph incrementally.
- Rationale: The Python version already has this. For large codebases, full re-extraction is expensive (time + tokens). SHA256 caching infrastructure already exists in our pipeline — this builds on it.
- How: Use
FileSystemWatcherto detect changes. Leverage existing SHA256 cache to identify changed files. Re-run extraction only for modified files, then merge into existing graph. - Difficulty: Medium
- Priority: High
- Description: Replace regex-based C# code extractors with proper Roslyn syntax/semantic analysis for type-safe, accurate AST extraction.
- Rationale: Regex is brittle with C# syntax (generics, nested types, attributes, LINQ expressions). Roslyn provides the full compilation model — type resolution, symbol binding, call graph analysis. This is a unique advantage the .NET port has over the Python version for C# codebases.
- How: Use
Microsoft.CodeAnalysis.CSharpto parse syntax trees. Walk the semantic model to extract classes, methods, interfaces, inheritance, dependency injection registrations, and call graphs. Output asExtractionResultusing existing schema. - Difficulty: Hard
- Priority: High
- Description: Upgrade from
TreeSitter.BindingstoTreeSitter.DotNetorTreeSitterLanguagePackfor robust multi-language AST parsing with 28-248+ language grammars. - Rationale: Current tree-sitter integration is basic.
TreeSitter.DotNet(by mariusgreuel) ships with 28+ pre-compiled grammars, cross-platform native libraries, and predicate support.TreeSitterLanguagePackbundles 248+ parsers. This would give us best-in-class polyglot parsing. - How: Evaluate
TreeSitter.DotNet(NuGet, v1.3.0) vsTreeSitterLanguagePack. Replace current bindings while maintaining the sameExtractionResultoutput contract. - Difficulty: Medium
- Priority: Medium
- Description: Offer Semantic Kernel or Microsoft Agent Framework (MAF) as an alternative orchestration layer for extraction, enabling multi-step reasoning, planning, and tool use during graph building.
- Rationale: Current extraction uses single-shot
IChatClientcalls. For complex codebases, multi-step agent reasoning could improve extraction quality — e.g., an agent that reads a class, follows its dependencies, and builds a richer subgraph. SK's plugin system and MAF's agent-first design both support this pattern. - How: Create an
AgentExtractorthat wraps SK Kernel or MAF agent with extraction-specific tools (read file, query existing graph, resolve type). Register as an alternativeIPipelineStage. - Difficulty: Hard
- Priority: Medium
- Description: Use the built knowledge graph as a retrieval source for Retrieval-Augmented Generation pipelines, enabling LLM-powered Q&A grounded in graph structure.
- Rationale: This is the natural next step after building a knowledge graph. Instead of flat document search, RAG queries traverse the graph to find relevant nodes, their connections, and community context. This aligns directly with Karpathy's "Q&A and search" phase.
- How: Implement an
IEmbeddingGenerator-powered index over graph nodes. Query combines graph traversal (BFS/DFS from relevant nodes) with semantic similarity. Return structured context (node + edges + community) to the LLM. - Difficulty: Hard
- Priority: Medium
- Description: Provide a GitHub Action that runs graphify in CI pipelines to detect architecture drift, report new dependencies, and flag structural changes between commits.
- Rationale: Knowledge graphs are most valuable when kept current. Running in CI ensures the graph stays fresh and can detect when architectural decisions are violated. Teams can set policies like "no new god nodes" or "all services must belong to a community."
- How: Create a Docker-based or composite GitHub Action. Compare graph snapshots between commits. Output a diff report as a PR comment or check annotation.
- Difficulty: Medium
- Priority: Medium
- Description: Interactive knowledge graph explorer inside VS Code — click a node to navigate to its source file, visualize communities, search by concept, and see architecture at a glance.
- Rationale: Developers live in VS Code. Bringing the graph into the editor eliminates context switching. The MCP server already exposes the right operations (query, explain, path, communities, analyze) — the extension would be a visual frontend.
- How: Build a VS Code webview extension that loads graph.json (or connects to the MCP server). Use vis.js or D3.js for visualization. Implement go-to-definition for graph nodes.
- Difficulty: Hard
- Priority: Medium
- Description: Aspire AppHost integration that provides a dashboard for monitoring graph build pipelines — live progress, extraction metrics, token usage, and error rates.
- Rationale: Aspire 9.4 supports custom commands, OTEL metrics, and interactive dashboards. For large codebase graph builds (which can take minutes and thousands of LLM calls), real-time visibility into the pipeline is essential. Custom business metrics (nodes extracted, edges inferred, communities detected) would surface in the Aspire dashboard.
- How: Create a
Graphify.Aspirehosting package. Emit OTEL metrics from pipeline stages. Register custom dashboard commands for triggering rebuilds, querying nodes, and viewing stats. - Difficulty: Medium
- Priority: Low
- Description: Build knowledge graphs that span multiple repositories, connecting cross-repo dependencies, shared libraries, and API contracts into a unified graph.
- Rationale: Real-world systems are rarely single-repo. Microservice architectures, monorepo-to-polyrepo migrations, and shared library ecosystems all benefit from cross-repo structural understanding. This is where graphify-dotnet could differentiate significantly from the Python version.
- How: Accept multiple repo paths or GitHub org + repo list. Build per-repo subgraphs, then merge with cross-repo edge detection (shared NuGet packages, API contracts, proto files, shared types).
- Difficulty: Hard
- Priority: Low
- Description: Implement Karpathy's "health check" concept for code knowledge — periodic LLM-driven consistency checks that identify stale relationships, orphaned nodes, conflicting concepts, and documentation drift.
- Rationale: Directly implements Karpathy's insight that LLMs should maintain knowledge bases, not just build them. A graph that was accurate last month may have drifted as code evolved. Automated lint catches this.
- How: Schedule or trigger a "lint" pass that walks the graph, samples nodes, and asks the LLM to verify relationships against current source. Flag stale or contradicted edges. Generate a health report.
- Difficulty: Medium
- Priority: Low
- Description: Enhance the Obsidian export to create a fully bidirectional workflow — changes in Obsidian (annotations, new links, corrections) feed back into the knowledge graph, and graph updates push new pages to the vault.
- Rationale: Karpathy's vision positions Obsidian as the IDE for knowledge. Our current Obsidian export is one-way. A bidirectional sync would enable developers to annotate, correct, and extend the graph using Obsidian's excellent editing UX, then have those changes persist in the structured graph.
- How: Watch the Obsidian vault directory for changes. Parse Markdown frontmatter and wikilinks to detect new/modified relationships. Merge back into KnowledgeGraph. Use Obsidian's graph view plugin for visualization.
- Difficulty: Hard
- Priority: Low
- Description: Package graphify-dotnet as a drop-in skill for coding agents (GitHub Copilot, Claude Code, Cursor) — similar to how the Python version works as a Claude Code skill.
- Rationale: The Python graphify's biggest adoption vector is its coding agent integration. Our MCP server already provides the protocol; packaging it as a discoverable skill lowers the adoption barrier.
- Difficulty: Medium
- Priority: High
- Description: Implement and publish benchmarks showing token usage per query vs. reading raw source files, following the Python version's "70x fewer tokens" methodology.
- Rationale: Token cost is a key decision factor for LLM-powered tools. Demonstrating efficiency builds trust and justifies the graph-building investment.
- Difficulty: Easy
- Priority: Medium
- Description: Direct Neo4j database integration for pushing graphs to a live Neo4j instance (beyond the current Cypher export file).
- Rationale: Neo4j's visualization and query capabilities (Cypher) are industry-standard for graph exploration. Live push enables real-time graph querying without file intermediaries.
- Difficulty: Medium
- Priority: Low
- Description: Build a graphify agent using MAF that can autonomously explore, explain, and maintain codebases as a conversational AI agent — not just a pipeline tool.
- Rationale: MAF is Microsoft's latest unified agent framework built on
IChatClient. A graphify MAF agent would be a natural evolution: from "build a graph" to "be a persistent, queryable codebase expert." - Difficulty: Hard
- Priority: Low
Karpathy described a workflow for knowledge manipulation — ingesting raw data, compiling it into structured wikis, and navigating knowledge through graph views. graphify-dotnet is the code-specific implementation of this vision:
| Karpathy's PKB Workflow | graphify-dotnet Equivalent |
|---|---|
Raw data ingestion (raw/ folder) |
File detection pipeline (code, docs, images) |
| LLM "compilation" to Markdown wiki | Semantic extraction → knowledge graph → Wiki/Obsidian export |
| Obsidian as the knowledge IDE | Obsidian vault export with backlinks and graph view |
| Graph view for navigation | Interactive HTML vis.js graph, MCP query tools |
| Q&A and search over the wiki | MCP server (query, explain, path, communities, analyze) |
| Health checks for consistency | Proposed: Knowledge graph lint mode (§3.13) |
| Output as new Markdown/slides | Wiki export, report generation |
graphify-dotnet bridges code understanding and personal knowledge bases by treating a codebase as a knowledge domain:
- Code is knowledge — Classes, functions, and their relationships are concepts and connections in a knowledge graph, just like topics in a wiki.
- The graph is the compiled wiki — Where Karpathy's LLM compiles articles into interlinked Markdown, graphify compiles code into an interlinked knowledge graph.
- MCP is the query layer — Where Karpathy uses LLM search over the wiki, graphify uses MCP tools to let AI assistants query the graph.
- Obsidian is the shared IDE — Both workflows use Obsidian as the human-navigable frontend for graph-structured knowledge.
The long-term vision: a developer's personal knowledge base that spans code, documentation, architecture decisions, and runtime behavior — all compiled, linked, and maintained by LLMs, navigated through graph views, and queryable by AI assistants.
| Priority | Items |
|---|---|
| High | dotnet tool (§3.1), Azure OpenAI (§3.2), Ollama (§3.3), Watch mode (§3.4), Roslyn AST (§3.5), Agent skill (§3.15) |
| Medium | Tree-sitter native (§3.6), SK/MAF integration (§3.7), RAG (§3.8), GitHub Actions (§3.9), VS Code extension (§3.10), Token benchmarks (§3.16) |
| Low | Aspire (§3.11), Multi-repo (§3.12), Health checks (§3.13), Obsidian bidirectional (§3.14), Neo4j live (§3.17), MAF agent (§3.18) |
- Karpathy's original tweet
- safishamsi/graphify (Python)
- awesome-llm-knowledge-bases
- awesome-llm-knowledge-systems
- Microsoft.Extensions.AI documentation
- Semantic Kernel + ME.AI integration
- Microsoft Agent Framework migration guide
- TreeSitter.DotNet
- OllamaSharp (.NET Ollama client)
- .NET Aspire 9.4 announcement
- Roslyn Source Generator Cookbook
- Building PKBs with LLMs: The Karpathy Method
- VentureBeat: Karpathy's LLM Knowledge Base Architecture
- DAIR.AI: LLM Knowledge Bases