Summary
We ran two A/B experiments on workingSet.enabled=true vs false in an isolated openclaw-mem-engine evaluation lane.
Current result:
- we did not observe visible answer-quality lift from
workingSet.enabled=true
- we did observe consistent extra context pressure from the working-set backbone
- current evidence favors
workingSet.enabled=false as the cleaner default until the feature is improved or a stronger win case is demonstrated
This issue is not "delete the feature".
It is: the current backbone shape appears too expensive relative to the quality gain we can currently measure.
Why we looked at this
We were seeing a recurring pattern where a compact but somewhat generic backbone summary would show up early and consume budget.
Hypothesis:
- maybe
workingSet improves long-session carryover and rule persistence
- or maybe it mostly adds cost / pressure without enough lift
We wanted receipts instead of vibes.
Comparison method
We normalized the evaluation harness before the final comparison and then ran:
- serialized arms: A then B
- isolated per-arm state
- read-only evaluation posture
- no auto-capture writeback
- explicit multi-turn session carryover when needed
This was done to compare the feature behavior itself rather than incidental test-run noise.
Experiment 1 — fixed probe set (7 probes)
Goal:
- test rule recall / constraint carryover / rule application
- compare answer quality vs context pressure
Result
Both arms answered the 7 probes correctly.
workingSet.enabled=true
- avg
beforeChars ≈ 2946
- avg
droppedCount = 2
- avg
truncatedChars ≈ 1257
- working set itself consumed about
900 chars per turn
workingSet.enabled=false
- avg
beforeChars ≈ 2001
- avg
droppedCount = 1
- avg
truncatedChars ≈ 579
Interpretation
For this probe set:
- no visible quality lift from
workingSet=true
- clear increase in budget pressure from
workingSet=true
Experiment 2 — harder multi-turn stress set
We then ran a more demanding carryover test with:
- seeded rules
- seeded open loops
- distractor turns
- a mid-run rule update
- final recall check
- final application/plan check
What this tested
This was meant to give workingSet a fairer chance on the scenario it should theoretically help with:
- carrying constraints across turns
- preserving updated rules instead of stale ones
- surviving distractors without losing task shape
Result
Both arms correctly carried:
- all 4 active rules
- all 3 open loops
- the updated rule version (not the stale earlier version)
- the final applied mini-plan
workingSet.enabled=true
- avg
beforeChars ≈ 2937
- avg
droppedCount = 2
- avg
truncatedChars ≈ 1263
- still paying ~
900 chars / turn for the backbone
workingSet.enabled=false
- avg
beforeChars ≈ 1991
- avg
droppedCount = 1
- avg
truncatedChars ≈ 551
Interpretation
Even on the harder multi-turn carryover test:
- we still saw no visible answer-quality win from
workingSet=true
- we still saw materially higher context pressure from
workingSet=true
That made the earlier result much less likely to be a weak-probe artifact.
Current conclusion
With current shaping and current budget settings:
workingSet.enabled=true appears to be costly but not yet sufficiently discriminative
workingSet.enabled=false currently looks like the better default
In other words:
the feature is not obviously wrong, but its current compression/backbone strategy is not yet earning its keep.
Where optimization might actually help
Instead of treating this as a binary keep/remove question, I think the next iteration should target the backbone economics:
1) Make working set more selective
Right now the backbone feels too eager / too broad.
Potential fixes:
- stricter section admission
- stronger relevance thresholding
- fewer items per section
- tighter decay / recency weighting
- don't emit a section unless it materially changes action quality
2) Make backbone adaptive to budget pressure
If the lane is already near budget pressure, the backbone should shrink aggressively or disappear.
Possible knobs:
- dynamic
maxChars instead of a fixed ~900-char chunk
- percentage-of-budget cap
- skip working-set generation when recall/budget pressure is already high
- degrade from structured summary -> minimal delta summary
3) Prefer delta summaries over restating the whole spine
A big reason the feature feels expensive is that it can restate too much stable information.
Potential direction:
- generate only the changed constraints / decisions / open loops since the prior turn
- or persist a compact internal representation and render only the needed slice for the current prompt
4) Measure on more adversarial long-horizon cases
We have not yet proven workingSet is never useful.
The next fair win-test would be something like:
- much longer horizon
- more distractor turns
- more late recall
- stale-rule replacement pressure
- multiple competing open loops / task branches
If workingSet can’t win there either, the case for default-off becomes much stronger upstream too.
5) Distinguish durable policy/context from task-working-set context
One suspicion: we may be mixing relatively durable carryover with ephemeral task-local state in one summary product.
That likely makes the summary fatter than needed.
Maybe these should be treated separately:
- durable policy / preferences
- task-local open loops
- turn-local scratchpad / delta
Ask / next step for the repo
I’d treat this as a design follow-up issue:
- keep
workingSet conceptually alive
- optimize for smaller, more discriminative, more adaptive summaries
- rerun the A/B with stronger long-horizon probes after the shaping changes
If useful, I can later convert this into a tighter benchmark card / reproducible eval harness summary for the repo.
Summary
We ran two A/B experiments on
workingSet.enabled=truevsfalsein an isolatedopenclaw-mem-engineevaluation lane.Current result:
workingSet.enabled=trueworkingSet.enabled=falseas the cleaner default until the feature is improved or a stronger win case is demonstratedThis issue is not "delete the feature".
It is: the current backbone shape appears too expensive relative to the quality gain we can currently measure.
Why we looked at this
We were seeing a recurring pattern where a compact but somewhat generic backbone summary would show up early and consume budget.
Hypothesis:
workingSetimproves long-session carryover and rule persistenceWe wanted receipts instead of vibes.
Comparison method
We normalized the evaluation harness before the final comparison and then ran:
This was done to compare the feature behavior itself rather than incidental test-run noise.
Experiment 1 — fixed probe set (7 probes)
Goal:
Result
Both arms answered the 7 probes correctly.
workingSet.enabled=truebeforeChars ≈ 2946droppedCount = 2truncatedChars ≈ 1257900 charsper turnworkingSet.enabled=falsebeforeChars ≈ 2001droppedCount = 1truncatedChars ≈ 579Interpretation
For this probe set:
workingSet=trueworkingSet=trueExperiment 2 — harder multi-turn stress set
We then ran a more demanding carryover test with:
What this tested
This was meant to give
workingSeta fairer chance on the scenario it should theoretically help with:Result
Both arms correctly carried:
workingSet.enabled=truebeforeChars ≈ 2937droppedCount = 2truncatedChars ≈ 1263900 chars/ turn for the backboneworkingSet.enabled=falsebeforeChars ≈ 1991droppedCount = 1truncatedChars ≈ 551Interpretation
Even on the harder multi-turn carryover test:
workingSet=trueworkingSet=trueThat made the earlier result much less likely to be a weak-probe artifact.
Current conclusion
With current shaping and current budget settings:
workingSet.enabled=trueappears to be costly but not yet sufficiently discriminativeworkingSet.enabled=falsecurrently looks like the better defaultIn other words:
Where optimization might actually help
Instead of treating this as a binary keep/remove question, I think the next iteration should target the backbone economics:
1) Make working set more selective
Right now the backbone feels too eager / too broad.
Potential fixes:
2) Make backbone adaptive to budget pressure
If the lane is already near budget pressure, the backbone should shrink aggressively or disappear.
Possible knobs:
maxCharsinstead of a fixed ~900-char chunk3) Prefer delta summaries over restating the whole spine
A big reason the feature feels expensive is that it can restate too much stable information.
Potential direction:
4) Measure on more adversarial long-horizon cases
We have not yet proven
workingSetis never useful.The next fair win-test would be something like:
If
workingSetcan’t win there either, the case for default-off becomes much stronger upstream too.5) Distinguish durable policy/context from task-working-set context
One suspicion: we may be mixing relatively durable carryover with ephemeral task-local state in one summary product.
That likely makes the summary fatter than needed.
Maybe these should be treated separately:
Ask / next step for the repo
I’d treat this as a design follow-up issue:
workingSetconceptually aliveIf useful, I can later convert this into a tighter benchmark card / reproducible eval harness summary for the repo.