Skip to content

fix(cross-repo): emit HTTP_CALLS for unindexed client libs and normalize URLs for route matching (#523)#536

Open
RithvikReddy0-0 wants to merge 5 commits into
DeusData:mainfrom
RithvikReddy0-0:fix/523-cross-repo-http-edges
Open

fix(cross-repo): emit HTTP_CALLS for unindexed client libs and normalize URLs for route matching (#523)#536
RithvikReddy0-0 wants to merge 5 commits into
DeusData:mainfrom
RithvikReddy0-0:fix/523-cross-repo-http-edges

Conversation

@RithvikReddy0-0

Copy link
Copy Markdown

Fixes two root causes of cross-repo-intelligence returning 0 edges (#523).

pass_calls.c

HTTP client calls (requests, httpx, axios, etc.) were silently dropped when
the client library wasn't indexed (external pip/npm dep). The callee resolved
to a QN but cbm_gbuf_find_by_qn returned NULL, so the call was discarded
before HTTP classification.

Fix: detect known HTTP/async patterns via cbm_service_pattern_match and emit
the edge even without a target node in the graph.

pass_cross_repo.c

Three issues in match_http_routes:

  1. Consumer url_path carried full URL (scheme+host+port); provider Route has
    bare path. Added cr_url_path() to strip scheme+authority before QN lookup.

  2. Concrete paths (/v2/orders/123) never matched templated routes
    (/v2/orders/{id}). Added cr_path_matches_template() and
    find_route_handler_fuzzy() for segment-level template matching.

  3. match_http_routes only searched HTTP_CALLS in the src project. When
    cross-repo is run from the provider side, HTTP_CALLS live in the consumer
    DB. Added reverse direction call so both orientations are covered.

Repro

Verified manually: FastAPI provider + requests consumer, cross-repo-intelligence
now returns cross_http_calls: 1 where it previously returned 0.

Checklist

  • Every commit is signed off (git commit -s) — required, CI rejects
    unsigned commits (DCO, see CONTRIBUTING.md)
  • Tests pass locally (make -f Makefile.cbm test)
  • Lint passes (make -f Makefile.cbm lint-ci)
  • New behavior is covered by a test (reproduce-first for bug fixes)

Note: build and manual repro verified in WSL (cross_http_calls: 1 confirmed).
Test suite and lint require Unix toolchain — ran in WSL but full make test
timed out; happy to add a regression test if the maintainer points me to the
right test file.

…ize URLs for route matching (DeusData#523)

- pass_calls.c: detect known HTTP/async client patterns (requests, httpx,
  axios, etc.) by service pattern match even when the target node doesn't
  exist in the graph (external dep not indexed). Fixes zero-edge output
  on normal repos where HTTP clients are pip/npm dependencies.

- pass_cross_repo.c: strip scheme+host+port from consumer url_path before
  QN lookup (cr_url_path). Add path-param template matching so concrete
  paths (/v2/orders/123) match provider route templates (/v2/orders/{id}).
  Add reverse-direction match so HTTP_CALLS in the consumer DB are found
  when cross-repo is run from the provider side.

Signed-off-by: RithvikReddy0-0 <rithvikreddymukkara@gmail.com>
Signed-off-by: RithvikReddy0-0 <rithvikreddymukkara@gmail.com>
Signed-off-by: RithvikReddy0-0 <rithvikreddymukkara@gmail.com>
@azuttre

azuttre commented Jun 20, 2026

Copy link
Copy Markdown

Built this branch on macOS (arm64) and traced resolve_single_call to see exactly what happens. The two halves behave differently:

pass_cross_repo.c (URL normalize + template match): works. Holding an HTTP_CALLS edge constant (url_path = the full URL http://order-service:8080/v2/orders/123), main links nothing and this branch links it to the templated route /v2/orders/{id}. Clean before/after.

pass_calls.c (emit without the client lib indexed): doesn't fire for a genuinely external requests. The call reaches resolve_single_call and first_string_arg holds the URL, so that part is fine. But with requests not installed or vendored anywhere indexable, the import map is empty (imp_count=0), so cbm_registry_resolve returns an empty qualified_name and the function returns at if (!res.qualified_name) return 0; before your new svc==HTTP emit. There is no QN for cbm_service_pattern_match to match. Trace:

ENTER    callee='requests.get' first_arg='http://svc:8080/v2/orders/123' imp_count=0
RESOLVED callee='requests.get' -> res_qn='(EMPTY)' svc=-1   # returns here

The moment requests is locally resolvable (a vendored stub or an installed venv), imp_count=1, res_qn becomes '...requests.get', svc=1, and it emits, but that is the resolved path, which main emits on too.

So the emit-without-target path seems to help only when the callee resolves to a QN that has no node, not when the external client resolves to nothing. Was requests pip-installed in your WSL repro? If so, cross_http_calls: 1 is the matcher fix plus a resolvable call, and the index-just-my-service case (the #523 scenario) is still 0. Happy to share the repro.

DeusData#523)

The previous emit-without-target path sat after the empty-QN early return,
so a genuinely external client (requests/axios not installed or vendored)
bailed at the empty-QN return before reaching it. The import map is empty,
cbm_registry_resolve returns no QN, and there was nothing for
cbm_service_pattern_match to classify.

Move the detection into the empty-QN branch and classify from the raw
callee name (requests.get -> HTTP, GET) instead of the resolved QN. Verified
without any vendored stub: HTTP_CALLS now fires and cross-repo links the
call to the provider templated route (cross_http_calls: 1).

Signed-off-by: RithvikReddy0-0 <rithvikreddymukkara@gmail.com>
@RithvikReddy0-0

Copy link
Copy Markdown
Author

Good catch you were exactly right. The emit sat after the empty-QN early return, so a genuinely external requests (empty import map → empty resolved QN) bailed before reaching it. My WSL repro had a vendored requests stub, which created a node and went through the resolved path so it never actually exercised the index-just-my-service case. That's on me.

Fixed in the latest commit: the detection now lives in the empty-QN branch and classifies from the raw callee name (requests.get → HTTP, GET) rather than the resolved QN. Re-verified with no stub and no install requests external, consumer indexes only its own service:

  • HTTP_CALLS edge now appears on the consumer (fetch_order → http://order-service:8080/v2/orders/123, method GET)
  • cross-repo links it to the provider's templated route: cross_http_calls: 1

One caveat worth flagging separately: a single-file provider still returns 0, but for an unrelated reason FastAPI route extraction (@app.get → Route node) only fires on the parallel pipeline path (>50 files); on the sequential path the decorator is captured but no Route node is created. That's the same "decorator captured but no route" gap you noted originally, and it's independent of this PR's matcher/emission changes. Happy to open a separate issue for it.

thanks for tracing this on your end ;)

@azuttre

azuttre commented Jun 21, 2026

Copy link
Copy Markdown

Re-validated 0a8a44f on macOS (arm64), with a genuinely external requests (no stub, no install, consumer indexes only its own service):

  • consumer now emits HTTP_CALLS for both calls (was 0 on main)
  • cross-repo links them to the provider: cross_http_calls: 2
    • fetch_order -> get_order (/v2/orders/123 -> /v2/orders/{id})
    • place_order -> create_order (/v2/orders)

So the empty-QN path is fixed. Confirmed on my end.

On the single-file caveat: I'm not reproducing it here. My provider is a single app.py with two routes (@app.get + @app.post), both Route nodes extracted fine, which is why the end-to-end run above links. So the no-route-on-single-file behavior may be platform-specific or a narrower trigger than file count, rather than a general <50-file thing. Didn't block the cross-repo case for me, but happy to share details if you open a separate issue for it.

Nice work turning this around so fast.

@RithvikReddy0-0

Copy link
Copy Markdown
Author

Thanks for re-validating ;)
Glad it holds on macOS too, and good to see both calls linking.

You're right to push back on the single-file theory if your single app.py with two routes extracts both Route nodes fine, then it is not a general parallel-vs-sequential thing. On my WSL setup the single-file provider consistently produced no Route node, but that's clearly a narrower or platform-specific trigger, not the file-count threshold I guessed at. I will dig into what is actually different on my end before claiming a cause, and open a separate issue with a proper repro if it is real rather than a local artifact.

Appreciate the thorough trace throughout it made both fixes tighter.

@DeusData

Copy link
Copy Markdown
Owner

Thanks @RithvikReddy0-0 — the unindexed-client + URL-normalization direction is right. Two things before this can land:

  1. Duplicate edges (blocking). The entry point now calls match_http_routes in both directions (src→tgt and tgt→src), but delete_cross_edges runs once and emit_cross_route_bidirectional already writes both sides — so the reverse pass re-emits the same CROSS_HTTP_CALLS pair, producing duplicates and inflating http_edges. Please dedupe (single direction, or guard against re-emitting an existing edge).
  2. Test. Add a reproduce-first test (refs cross-repo-intelligence returns 0 edges for a byte-identical call/route #523): a byte-identical client call/route that must yield exactly one edge, asserting no duplicates.

Also, emit_http_async_edge(ctx, call, source_node, source_node, ...) passes the source node as both source and target — please add a comment confirming that's intentional for the unindexed-external case. 🙏

…DeusData#523)

Addresses review on DeusData#536.

insert_cross_edge now skips insertion when an identical
(source_id, target_id, type) edge already exists. The pass reaches the same
caller/route pair from both directions and emit_cross_route_bidirectional
writes both DBs, so without this guard the same CROSS_HTTP_CALLS pair was
re-emitted and inflated http_edges. Verified idempotent: repeated runs and
runs from either project side both yield cross_http_calls: 1 with exactly one
edge per DB.

Documented why emit_http_async_edge is called with source_node as both source
and target in the unindexed-external-client path.

Signed-off-by: RithvikReddy0-0 <rithvikreddymukkara@gmail.com>
@RithvikReddy0-0

Copy link
Copy Markdown
Author

Thanks for the review. Addressed all three in 4817d79.

1. Duplicate edges. Fixed via an idempotency guard in insert_cross_edge: it skips insertion when an identical (source_id, target_id, type) edge already exists. Verified no inflation across both repeated runs and runs initiated from either project:

  • run twice from provider side → cross_http_calls: 1 both times
  • run from consumer side → cross_http_calls: 1
  • exactly one edge per DB in all cases (fetch_order → url in consumer, get_order → route in provider)

One thing worth raising on the "single direction" suggestion: I don't think dropping the reverse match_http_routes works on its own. match_http_routes reads HTTP_CALLS from the source project, and emit_cross_route_bidirectional only runs inside a successful match. In the reporter's invocation (run from the provider with target_projects=[consumer]), the provider has no outbound HTTP_CALLS — it serves routes — so the forward pass matches nothing and the bidirectional write never fires. The reverse pass is what lets a provider-initiated run discover the consumer's calls. So I kept both directions and made emission idempotent instead.

Tradeoff I want to flag: the guard adds a find_edges_by_source_type lookup per candidate edge. On large graphs that's a non-trivial cost. If you'd prefer, I can switch to deduping once at the end of the pass (collect, unique, then insert) or a dedicated cbm_store_edge_exists to avoid pulling the full edge list , happy to go whichever way fits the codebase.

3. Self-pass comment. Added documents that the external client has no graph node, so source_node is passed as both source and target, and that the URL/topic path links source → Route without dereferencing the duplicated arg.

2. Test. This is the one I'd like a pointer on. cbm_cross_repo_match resolves project DBs via cbm_resolve_cache_dir() + project name rather than an explicit path, so the existing test_pipeline.c harness (which passes explicit temp db_paths) doesn't sandbox it cleanly a naive test would write into the real cache dir. Is there an existing pattern for overriding the cache dir in tests, or would you prefer I add the dedup assertion at the store level instead? Happy to write the reproduce-first byte-identical test once I know which approach fits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants