Skip to content

feat:Hierarchical Knowledge Aggregation for Graphify#264

Open
ljinshuan wants to merge 8 commits intosafishamsi:v4from
ljinshuan:feature/team_version
Open

feat:Hierarchical Knowledge Aggregation for Graphify#264
ljinshuan wants to merge 8 commits intosafishamsi:v4from
ljinshuan:feature/team_version

Conversation

@ljinshuan
Copy link
Copy Markdown
Contributor

@ljinshuan ljinshuan commented Apr 12, 2026

Close #265

Hierarchical Knowledge Aggregation for Graphify

Issue Origin: Single flat knowledge graph cannot represent hierarchical project structure. Large-scale codebases need layered knowledge with bottom-up aggregation and intelligent query routing.

1. Problem Statement: Why Hierarchical Knowledge Aggregation?

Graphify originally supported only a single flat knowledge graph — all source files (code, docs, images) are extracted, merged into one graph.json, and served via MCP as a flat graph.

This reveals three core problems in large-scale projects:

  1. Information Overload 🤯: A 500+ file project produces a graph with thousands of nodes. LLM query token budgets are wasted on irrelevant details, and key architectural information is buried.

  2. Lack of Abstraction Levels 🏗️: Microservice architectures are naturally hierarchical (services → domains → system), but flat graphs cannot represent this. Asking "what is the system architecture?" vs. "how does auth function call work?" requires completely different abstraction levels.

  3. Inefficient Queries 🐌: Every query searches the entire graph, unable to leverage hierarchical structure to narrow scope.

Core Insight: Knowledge should be layered like geographic maps — bottom layers are street-level detail, upper layers are city-level overviews. Upper layer graph = own content + lower layer summary, forming a strict layered DAG.

lijinshuan and others added 5 commits April 12, 2026 23:14
Phase 1 - layer-config-foundation:
- Add pyyaml as optional dependency (layers extras group)
- Create layer_config.py: LayerConfig dataclass, load_layers(), DAG
  validation (duplicate IDs, unknown parents, cycle detection),
  topological sort, level computation, LayerRegistry
- Add merge_graphs() in build.py with summary: prefix + provenance
- Add aggregate.py stub with strategy='none'

Phase 2 - aggregation-engine:
- Implement topk_filter strategy: top-K nodes by degree, hub exclusion,
  confidence filtering
- Implement community_collapse strategy: community detection, collapsed
  nodes with _collapsed/_community_id attrs, bridge edge preservation
- Implement llm_summary strategy: LLM-powered summarization with
  structured prompt, JSON parsing, fallback to topk_filter
- Implement composite strategy: community_collapse -> llm_summary pipeline
- Update aggregate() dispatcher for all 5 strategies

Phase 3 - query-routing:
- Create query_router.py: QueryRouter with keyword scoring, level-weighted
  abstraction heuristics, CJK substring matching
- Implement auto-zoom: sparse result drill-down to child layers
- Add layer_info/drill_down MCP tools in serve.py
- Add --layers/--layer/--auto-zoom flags to CLI query command
- Multi-layer mode in serve() with QueryRouter integration

Phase 4 - cli-polish:
- Add graphify layer-info --layers <path> command (table format)
- Add graphify layer-tree --layers <path> command (ASCII tree)
- Add graphify layer-diff <id1> <id2> --layers <path> command
- Add graph_diff() in build.py for structural comparison
- Save aggregation provenance as from_<parent>.json
- Parallel same-depth layer builds with ProcessPoolExecutor
- Auto-detect multi-layer mode in MCP server from output directory
- Update CLI help text with all new commands

Tests: 109 new tests across 8 test files, all passing

Ref: safishamsi#263
- README_TEAM.md: English documentation for hierarchical knowledge aggregation
- README_TEAM.zh-CN.md: Chinese version
- worked_team/: real data validation with 3 corpora (example, httpx, mixed-corpus)
  - layers.yaml: 3-layer config (Code → Docs → Overview)
  - graphify-out/: build output with 5.3x compression from L0 to L2

Ref: safishamsi#263
- Move pyyaml from optional 'layers' extras to core dependencies
- Fix typo: 'graphifyy' -> 'graphify' in ImportError message
- Update error message for missing pyyaml (now a required dep)
- This fixes CI test failures where pyyaml was not installed

Ref: safishamsi#263
@ljinshuan ljinshuan changed the title Feature/team version feat:Hierarchical Knowledge Aggregation for Graphify Apr 12, 2026
- pyproject.toml: keep both pyyaml (ours) and tree-sitter-verilog (theirs)
- __main__.py: keep --layers query routing (ours) + upstream's try/except
  error handling for graph loading
- aggregate.py: fix god_nodes() key change (edges → degree) from upstream

Ref: safishamsi#263
@ljinshuan
Copy link
Copy Markdown
Contributor Author

👋 Friendly reminder — this PR has been sitting here for a while and we've already resolved merge conflicts multiple times to keep it in sync with upstream/v4.

At this point, we'd really appreciate a decision on whether this feature aligns with the project's direction:

If this is something you want — a quick review and merge would be amazing. The feature is ready and has been kept up-to-date through multiple conflict resolutions.

If this isn't the right fit — no hard feelings at all! Just let us know and we'll gladly close this PR to keep the backlog clean.

Either way, a response would be greatly appreciated so we can stop burning cycles on conflict resolution and move forward. 🙏

Thanks for your time!

#265

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Hierarchical Knowledge Aggregation — layered DAG build, multi-strategy summarization, and intelligent query routing

1 participant