Background:
In the Shelby tenant in testnet, we dug into a user report of 2 hosts that could not communicate - 46.105.65.230 (Stakly) and 198.13.134.39 (tsams3-shelby-testnet-storage1). Packets from 198.13.134.39 arrived at 46.105.65.230, but 46.105.65.230 never sent a reply because it had no route to 198.13.134.39. Apparently this is what happened:
doublezerod on 46.105.65.230 received the route to 198.13.134.39 over BGP and installed it in the linux kernel routing table
at some later point, an unknown process removed the route from the linux kernel routing table
connectivity to that host remains broken indefinitely - doublezerod does not reinstall it if it gets removed, and doublezero routes lists it as Kernel State = absent. For the route to be reinstalled, we'd need to restart doublezerod, or the route would have to be withdrawn and then re-advertised in BGP
We both looked for doublezerod code paths that could delete the route, but we can't find any that don't also log the delete action, and we didn't see that delete action in 46.105.65.230's doublezerod log.
It's still possible that it's a doublezerod problem, but it's also possible that some other process is removing the routes. Unfortunately we don't know how to reproduce the issue, but if we keep the command below running (and logging) on the impacted hosts, if and when it happens again we can at least learn what process ID deleted the route.
sudo bpftrace -e 'kprobe:fib_table_delete { printf("%s [pid %d uid %d] deleted a route\n", comm, pid, uid); }'
Task:
Let's mitigate this issue in doublezerod by having it periodically scan the linux kernel routing table for missing routes, and reinstall any routes that it finds missing.
While we're at it, let's modify the log level for liveness session down logs, because these logs don't make sense at INFO because it looks like the liveness session came up 5 times in a row, but it must have gone down in order for it to go back up.
Apr 30 15:56:50 SHEL-VAL-TEST-OV-FR-02 doublezerod[1342]: {"time":"2026-04-30T15:56:50.820996452Z","level":"INFO","msg":"liveness: session up (global passive; no-op)","peer":"interface: doublezero0, localIP: 46.105.65.230, peerIP: 198.13.134.39","route":"table: 254, dst: 198.13.134.39/32, src: 46.105.65.230, nh: 169.254.3.128 protocol: bgp",">
Apr 30 15:56:47 SHEL-VAL-TEST-OV-FR-02 doublezerod[1342]: {"time":"2026-04-30T15:56:47.799877572Z","level":"INFO","msg":"liveness: session up (global passive; no-op)","peer":"interface: doublezero0, localIP: 46.105.65.230, peerIP: 198.13.134.39","route":"table: 254, dst: 198.13.134.39/32, src: 46.105.65.230, nh: 169.254.3.128 protocol: bgp",">
Apr 30 15:56:44 SHEL-VAL-TEST-OV-FR-02 doublezerod[1342]: {"time":"2026-04-30T15:56:44.762394702Z","level":"INFO","msg":"liveness: session up (global passive; no-op)","peer":"interface: doublezero0, localIP: 46.105.65.230, peerIP: 198.13.134.39","route":"table: 254, dst: 198.13.134.39/32, src: 46.105.65.230, nh: 169.254.3.128 protocol: bgp",">
Apr 30 15:56:26 SHEL-VAL-TEST-OV-FR-02 doublezerod[1342]: {"time":"2026-04-30T15:56:26.51670077Z", "level":"INFO","msg":"liveness: session up (global passive; no-op)","peer":"interface: doublezero0, localIP: 46.105.65.230, peerIP: 198.13.134.39","route":"table: 254, dst: 198.13.134.39/32, src: 46.105.65.230, nh: 169.254.3.128 protocol: bgp","c>
Apr 28 15:30:32 SHEL-VAL-TEST-OV-FR-02 doublezerod[1342]: {"time":"2026-04-28T15:30:32.187480511Z","level":"INFO","msg":"liveness: session up (global passive; no-op)","peer":"interface: doublezero0, localIP: 46.105.65.230, peerIP: 198.13.134.39","route":"table: 254, dst: 198.13.134.39/32, src: 46.105.65.230, nh: 169.254.3.128 protocol: bgp",">
Background:
In the Shelby tenant in testnet, we dug into a user report of 2 hosts that could not communicate - 46.105.65.230 (Stakly) and 198.13.134.39 (tsams3-shelby-testnet-storage1). Packets from 198.13.134.39 arrived at 46.105.65.230, but 46.105.65.230 never sent a reply because it had no route to 198.13.134.39. Apparently this is what happened:
doublezerod on 46.105.65.230 received the route to 198.13.134.39 over BGP and installed it in the linux kernel routing table
at some later point, an unknown process removed the route from the linux kernel routing table
connectivity to that host remains broken indefinitely - doublezerod does not reinstall it if it gets removed, and doublezero routes lists it as Kernel State = absent. For the route to be reinstalled, we'd need to restart doublezerod, or the route would have to be withdrawn and then re-advertised in BGP
We both looked for doublezerod code paths that could delete the route, but we can't find any that don't also log the delete action, and we didn't see that delete action in 46.105.65.230's doublezerod log.
It's still possible that it's a doublezerod problem, but it's also possible that some other process is removing the routes. Unfortunately we don't know how to reproduce the issue, but if we keep the command below running (and logging) on the impacted hosts, if and when it happens again we can at least learn what process ID deleted the route.
sudo bpftrace -e 'kprobe:fib_table_delete { printf("%s [pid %d uid %d] deleted a route\n", comm, pid, uid); }'
Task:
Let's mitigate this issue in doublezerod by having it periodically scan the linux kernel routing table for missing routes, and reinstall any routes that it finds missing.
While we're at it, let's modify the log level for liveness session down logs, because these logs don't make sense at INFO because it looks like the liveness session came up 5 times in a row, but it must have gone down in order for it to go back up.