feat: iroh DNS discovery controller#147
Merged
zachsmith1 merged 9 commits intomainfrom May 1, 2026
Merged
Conversation
Adds IrohDNSReconciler, gated by connector.iroh.dnsEnabled, that watches Connector resources across project clusters and maintains a 1-1 DNSRecordSet in a separately-configured downstream cluster carrying the iroh-format TXT record at "<recordPrefix>.<z32-endpoint-id>.<baseDomain>". For each Connector the reconciler: - resolves spec.connectorClassName and only acts when the class controllerName is networking.datumapis.com/datum-connect (legacy) or networking.datumapis.com/iroh-quic-tunnel (current); - adds a networking.datumapis.com/iroh-dns-cleanup finalizer so cross- cluster DNSRecordSet cleanup runs even though OwnerReferences can't cross clusters; - reads status.connectionDetails.publicKey, hex-decodes the EndpointId and z-base-32 encodes it for the DNS lookup name; - builds a DNSRecordSet with one TXT entry per RFC 1464 attribute (relay=<url>, addr=<sockaddr>...). IPv6 addresses are bracketed via net.JoinHostPort. Empty relay or empty address list → that entry is omitted; only when both are missing do we skip publishing entirely. - server-side-applies the DNSRecordSet via the downstream client, field-managed under "network-services-operator/iroh-dns"; - deletes the DNSRecordSet when the Connector is deleted, when the class no longer routes to iroh, or when the agent's status drops the connection details (so stale TXT doesn't linger). Wires into cmd/main.go behind the dnsEnabled gate, building the downstream client from IrohConnectorConfig.DownstreamRestConfig (empty path falls back to in-cluster, matching DiscoveryConfig). The dns.networking.miloapis.com/v1alpha1 scheme is already registered on the operator's scheme so it's reused for the downstream client. Tests cover the status-gating decision table, the full record-set contents (z32 vs hex boundary, IPv6 bracketing, labels, TTL, multiple addresses), the relay-only/addr-only branches, and the invalid hex error path.
joinIrohAddresses now sorts by (address, port) before joining. pk.Addresses originates from iroh-base's iter-over-set, which is not order-stable, so the same set of endpoints can show up in different orders across heartbeats. Without sorting we'd produce different "addr=" TXT content on each reconcile, every server-side apply would trip a write, dns-operator would re-PATCH PowerDNS — pure churn for no behavior change. The sort runs on a clone so the caller's slice is preserved (it's shared with watchers and other reconciler passes). Tests cover the normalization (reversed input → canonical output), same-address-different-port ordering, and non-mutation of the input.
The previous baseDomain field was documented as "the suffix appended
to the prefix and z32 EndpointId to form the full lookup name" but
the controller never actually consumed it — record names were built
solely from recordPrefix + z32 and dns-operator picked the origin
from the DNSZone. baseDomain was a redundant config knob that, if
disagreeing with the zone, would silently misconfigure.
Replace it with recordSuffix: optional labels appended after the z32
EndpointId, before the zone origin. The DNSZone's spec.domainName
remains the only source of authoritative origin truth (and the zone's
metadata.name need not match its actual domain — they often won't).
For "_iroh.<z32>.connectors.example.com" with a DNSZone whose
spec.domainName is example.com, set recordSuffix=connectors. For
records directly under the zone root, leave recordSuffix empty.
Validation drops the baseDomain check; only dnsZoneRef.{name,namespace}
remain required when dnsEnabled is true. Tests updated; new test
covers the empty-suffix branch so we don't regress to the old
"<prefix>.<z32>." naming.
The previous design (1 DNSRecordSet per Connector UID) created
duplicate downstream RecordSets when multiple Connectors shared an
iroh keypair — observed in staging where four Connectors with the
same z32 produced four DNSRecordSets fighting over the same DNS name,
three stuck at PROGRAMMED=False. Multiple Connectors per keypair is
legitimate (same agent registered against two projects, multi-device
reuse) so the controller has to handle it.
This change:
- DNSRecordSet name is now iroh-<z32>, deterministic on the endpoint
id rather than the Connector UID. Multiple Connectors that share a
keypair collapse to one record at the iroh-DNS name they all need.
- Reconcile loop is claim-based:
Get downstream record →
not found : Create with our claim (label
iroh-dns-claimed-by-uid=<connector.UID>)
found, ours : SSA refresh content
found, foreign : defer; surface IrohDNSPublished=False with
Reason=DeferredToOwner naming the foreign
Connector via cluster/namespace/name labels.
AlreadyExists on Create is treated as "sibling raced"; the next
reconcile lands in the foreign-claim branch.
- On Connector deletion (or class-no-longer-routes-to-iroh, or
status loses connection details), we delete the DNSRecordSet only
if we hold the claim. Foreign-owned records are untouched.
- IrohDNSPublished status condition added on Connector with reasons
Owner / DeferredToOwner / Pending so users can see at the resource
level which Connector is publishing.
- Downstream DNSRecordSet is now exposed via cluster.Cluster (was
client.Client), wired through the manager's errgroup like the
existing downstreamCluster. A WatchesRawSource via mcsource.Kind
enqueues the owner Connector identified in the labels when the
DNSRecordSet changes — primarily catching drift / external delete
for the owner. Sibling handover (when an owner's record disappears,
another Connector should take over) rides the existing Connector
watch — multicluster-runtime's manager exposes GetCluster(name) but
not enumeration, so a downstream event can't fan out to siblings
across project clusters. Connector status updates (lease renewals)
fire frequently enough that handover converges within seconds.
Documented inline.
Tests: status gating, full record contents (asserts new label set:
iroh-dns-claimed-by-uid plus connector cluster/namespace/name),
empty/non-empty recordSuffix, partial record (relay-only), invalid
hex id, and a new test verifying two distinct Connectors with the
same key compute the same DNSRecordSet name (the load-bearing claim
property). Address sorting / IPv6 bracketing / non-mutation tests
preserved.
Existing iroh-<UID> records from the v1 design are left orphaned and
will be cleaned up manually.
4 tasks
Two related fixes pushed to staging by the earlier deploy were both
broken:
1. The downstream DNSRecordSet Create kept failing with
`metadata.labels: Invalid value: "/test-project-xp3tg3": a valid
label must be ... [A-Za-z0-9][-A-Za-z0-9_.]* ...`. Multicluster-
runtime cluster names start with "/", which is invalid as a raw
k8s label value. Apply the same `cluster-<name-with-/-as-_>`
encoding pattern that downstreamclient/mappednamespace.go uses
inline. Decode in the watch handler and the foreign-claim status
message.
2. The reconciler couldn't make handover converge. When an owner
Connector is deleted, sibling Connectors that share the same iroh
keypair need to reconcile and take over. The Connector watch
alone wasn't enough — the agent updates the Lease on every
heartbeat but doesn't necessarily touch the Connector itself, so
the Connector watch stayed silent for long stretches. Mirror the
pattern from connector_controller.go and add
`Watches(&Lease{}, EnqueueRequestForOwner(&Connector{},
OnlyControllerOwner()))`. Lease renewals now fire iroh-dns
reconciles on the lease cadence (≤30s in staging), bounding
handover time.
Tests cover the encoded-form expectation in the labels assertion and
add a round-trip TestEncodeDecodeIrohClusterLabel.
CI lint flagged releaseIfOwner's cl parameter as unused — the function only writes to the downstream client, not the upstream cluster. Drop it from the signature and the three call sites. Also extracts a few repeated test literals (relay URL, sample IPv4 and IPv6 addresses) into package-level constants, since the new test cases pushed their occurrence counts past goconst's threshold.
Switch app.kubernetes.io/managed-by from "network-services-operator-iroh-dns" to "networking.datumapis.com" so iroh DNS records show up in the same selector queries as gateway-managed records (gateway_dns_controller.go uses the same value).
scotwells
previously approved these changes
May 1, 2026
iroh's DNS parser at iroh-relay-0.95.1/src/endpoint_info.rs:307-312
runs SocketAddr::from_str on every IrohAttr::Addr value as-is and
silently drops failures via .ok():
let ip_addrs = attrs
.get(&IrohAttr::Addr)
.into_iter()
.flatten()
.filter_map(|s| SocketAddr::from_str(s).ok())
.map(TransportAddr::Ip);
It does not split on whitespace. iroh's serializer (line 511-520)
emits one (Addr, addr.to_string()) attribute per IP. So the canonical
wire form is multiple "addr=<single-sockaddr>" TXT lines, not one
"addr=<sockaddr> <sockaddr>..." line.
Our controller was packing all addresses into one space-separated
TXT line. SocketAddr::from_str("A B C") fails, so iroh dropped the
addr= record entirely and only saw the relay= entry. Confirmed in
staging iroh-gateway logs: a successful DNS lookup against our
origin returned EndpointInfo { addrs: {Relay(...)} } — direct
addresses missing despite the TXT record carrying them.
Fix: emit one RecordEntry per address. Sort order preserved via
sortIrohAddresses so SSA stays a no-op when nothing changed.
Tests updated: now asserting 3 records (relay + 2× addr) instead of 2.
joinIrohAddresses retired in favor of sortIrohAddresses. Round-trip
test now operates on sorted addresses, not joined string.
CI golangci-lint flagged the var declaration as preallocatable now that we append in a loop. Capacity is 1 (relay) + len(pk.Addresses) — overallocates by 1 when relay is empty, which is fine.
scotwells
approved these changes
May 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
IrohDNSReconciler(gated byconnector.iroh.dnsEnabledfrom feat: add connectors.iroh config block and validation #146) watchesConnectoracross project clusters and maintains a 1-1DNSRecordSetin a configured downstream cluster carrying the iroh-format TXT record at<recordPrefix>.<z32-endpoint-id>.<baseDomain>(lookup name iroh's resolver actually queries — seeiroh-0.95.1/src/discovery/dns.rs:22-25).spec.controllerName ∈ {networking.datumapis.com/datum-connect, networking.datumapis.com/iroh-quic-tunnel}(legacy + current).<character-string>per attribute. The reconciler emits arelay=<url>entry when present, anaddr=<sockaddr> [<sockaddr>...]entry when present, both joined as multipleRecordEntrys at the same name. IPv6 addresses are bracketed vianet.JoinHostPort.Idin iroh's hex Display form; we hex-decode and z-base-32 encode for the DNS lookup name (using feat: add z-base-32 encoder for iroh EndpointId hex form #145's helper).networking.datumapis.com/iroh-dns-cleanupsince OwnerReferences can't cross clusters. The DNSRecordSet is deleted on Connector deletion, on class change away from iroh, or when status drops the connection details (stale TXT prevention).network-services-operator/iroh-dns.Wiring
cmd/main.goonly constructs the downstream client and registers the controller whendnsEnabled. EmptydownstreamKubeconfigPathfalls back to in-cluster (single-cluster topology), matchingDiscoveryConfig/ProjectClientpatterns.dns.networking.miloapis.com/v1alpha1is already on the operator's scheme (cmd/main.go:72), reused for the downstream client.Tests
TestBuildDesiredRecordSet_StatusGating— table tests for which status shapes publish vs. skip (no details / no PK / id-only / id+relay / id+addrs / id+both).TestBuildDesiredRecordSet_RecordContents— full content assertion: deterministic nameiroh-<UID>, namespace from config, RecordType=TXT, both TXT lines, record name_iroh.<z32>, TTL, labels.TestBuildDesiredRecordSet_RelayOnlyOmitsAddrEntry— partial-record branch.TestBuildDesiredRecordSet_InvalidEndpointId— non-hex id surfaces error rather than silently writing.TestJoinIrohAddresses— IPv4 single, IPv6 single (bracketed), mixed, empty.go test ./...,go vet ./...,make manifests(no role diff — RBAC markers are subset of existing connector controller).make lintstill blocked by the pre-existing pinned-linter / go1.24 issue noted in #146; rango vet+gofmt -las substitutes (clean).Out of scope
GOLANGCI_LINT_VERSIONsomake lintruns against go1.24.