Skip to content

stress: raise Linux RSS threshold to 384 MiB; add ExecRSSGrowthOverride#4

Merged
dmcgowan merged 1 commit into
mainfrom
increase-stress-leak-threshold
Jun 26, 2026
Merged

stress: raise Linux RSS threshold to 384 MiB; add ExecRSSGrowthOverride#4
dmcgowan merged 1 commit into
mainfrom
increase-stress-leak-threshold

Conversation

@dmcgowan

Copy link
Copy Markdown
Member

The 64 MiB Linux threshold was calibrated for thin supervisor shims (runc-style), where the shim process has negligible in-process overhead. Shims that host a VM in-process (e.g. nerdbox running libkrun on Linux) exhibit a large one-time RSS step from VM initialization that saturates early and does not grow linearly with exec count.

Observed nerdbox data over a 19-minute / ~6000-iter run:

RSS after 30s / ~150 iters: +181 MiB
RSS after 60s / ~325 iters: +187 MiB (most growth happens here)
RSS after 11m / ~3500 iters: +194 MiB
RSS after 19m / ~6000 iters: +204 MiB (saturated)

Growth from 60s to 19 minutes is only +17 MiB despite 18x more iterations; the per-iteration rate drops from ~570 KiB at 30s to ~3 KiB at steady state — a clear saturation signature, not a leak.

Raise the Linux default to 384 MiB (matching Windows, which has the same one-time VM-state allocation pattern). 384 MiB = ~1.9x the observed peak (~204 MiB), giving headroom for CI variance while still catching genuine per-exec leaks: at the observed steady-state rate of ~3 KiB/iter a linear leak would cross the threshold after ~60 000 iterations (~45 minutes at 22 iter/s).

Also add ExecRSSGrowthOverride to StressOptions so callers with unusually large or unusually small in-process overhead can tune the threshold without modifying the library.

The 64 MiB Linux threshold was calibrated for thin supervisor shims
(runc-style), where the shim process has negligible in-process overhead.
Shims that host a VM in-process (e.g. nerdbox running libkrun on Linux)
exhibit a large one-time RSS step from VM initialization that saturates
early and does not grow linearly with exec count.

Observed nerdbox data over a 19-minute / ~6000-iter run:

  RSS after  30s /  ~150 iters: +181 MiB
  RSS after  60s /  ~325 iters: +187 MiB  (most growth happens here)
  RSS after  11m / ~3500 iters: +194 MiB
  RSS after  19m / ~6000 iters: +204 MiB  (saturated)

Growth from 60s to 19 minutes is only +17 MiB despite 18x more
iterations; the per-iteration rate drops from ~570 KiB at 30s to
~3 KiB at steady state — a clear saturation signature, not a leak.

Raise the Linux default to 384 MiB (matching Windows, which has the
same one-time VM-state allocation pattern). 384 MiB = ~1.9x the
observed peak (~204 MiB), giving headroom for CI variance while still
catching genuine per-exec leaks: at the observed steady-state rate of
~3 KiB/iter a linear leak would cross the threshold after ~60 000
iterations (~45 minutes at 22 iter/s).

Also add ExecRSSGrowthOverride to StressOptions so callers with
unusually large or unusually small in-process overhead can tune the
threshold without modifying the library.

Signed-off-by: Derek McGowan <derek@mcg.dev>
@dmcgowan dmcgowan merged commit bb4781e into main Jun 26, 2026
5 of 6 checks passed
@dmcgowan dmcgowan deleted the increase-stress-leak-threshold branch June 26, 2026 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants