Skip to content

[bug] Route extractor over-fits: GraphQL query literals, whole config-file contents, and regex literals become Route nodes #598

Description

@ecosuper2025

Version: codebase-memory-mcp v0.8.1 (Windows, compiled .exe)
Project: real-world repo, 93,315 nodes, 996 Route nodes

Summary

Route extraction classifies many non-route strings as routes. On our repo, 46 of 996 routes are clearly junk, and the empty-source bucket mixes real HTTP endpoints with regex/path string literals.

Repro

search_graph(label="Route") then group by source:

source count what they actually are
decorator 41 ✅ real Flask/Django route decorators (correct)
"" (empty) 909 real API paths from frontend HTTP calls + regex/path literals
graphql 34 ❌ entire GraphQL query string literals (query { categoryList {...} }) turned into one Route node each
infra 12 ❌ the entire text of .codex/agents/*.toml files turned into a single Route node

Empty-source junk samples (method=ANY, degree 0):
/.{2}/g, /<table/i (JS regex literals), /_VBA_PROJECT_CUR/VBA/dir, /::/, /.well-known/openid-configuration.

Minimal example

// a JS regex literal — NOT a route
const re = /<table/i;
# a query string passed to a client — NOT a route
query { products(search: "x") { items { sku } } }

Both currently produce Route nodes.

Expected

  • Route extraction should require an actual route declaration (decorator/router registration/framework binding), not "string contains slashes" or "file looks like config".
  • GraphQL query literals should not be Route (a GraphQLOperation label would be fine).
  • Config/agent files (.codex/*.toml, *.md) should not have their whole contents emitted as a single Route.
  • At minimum: don't classify regex literals (/…/[gimsuy]) as routes.

Workaround (consumer side)

Filter WHERE r.source='decorator' OR (r.method IS NOT NULL AND r.method <> 'ANY') → 996 → 347, junk removed. But this loses real decorator-less routes inconsistently; a clean extractor is preferable.

Related

Distinct from the exclude-config issues (#500/#510) — this is mis-classification of content, not indexing of build artifacts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions