Skip to content

workingSet A/B: no observed quality lift yet; consistent context-pressure cost #67

@phenomenoner

Description

@phenomenoner

Summary

We ran two A/B experiments on workingSet.enabled=true vs false in an isolated openclaw-mem-engine evaluation lane.

Current result:

  • we did not observe visible answer-quality lift from workingSet.enabled=true
  • we did observe consistent extra context pressure from the working-set backbone
  • current evidence favors workingSet.enabled=false as the cleaner default until the feature is improved or a stronger win case is demonstrated

This issue is not "delete the feature".
It is: the current backbone shape appears too expensive relative to the quality gain we can currently measure.


Why we looked at this

We were seeing a recurring pattern where a compact but somewhat generic backbone summary would show up early and consume budget.
Hypothesis:

  • maybe workingSet improves long-session carryover and rule persistence
  • or maybe it mostly adds cost / pressure without enough lift

We wanted receipts instead of vibes.


Comparison method

We normalized the evaluation harness before the final comparison and then ran:

  • serialized arms: A then B
  • isolated per-arm state
  • read-only evaluation posture
  • no auto-capture writeback
  • explicit multi-turn session carryover when needed

This was done to compare the feature behavior itself rather than incidental test-run noise.


Experiment 1 — fixed probe set (7 probes)

Goal:

  • test rule recall / constraint carryover / rule application
  • compare answer quality vs context pressure

Result

Both arms answered the 7 probes correctly.

workingSet.enabled=true

  • avg beforeChars ≈ 2946
  • avg droppedCount = 2
  • avg truncatedChars ≈ 1257
  • working set itself consumed about 900 chars per turn

workingSet.enabled=false

  • avg beforeChars ≈ 2001
  • avg droppedCount = 1
  • avg truncatedChars ≈ 579

Interpretation

For this probe set:

  • no visible quality lift from workingSet=true
  • clear increase in budget pressure from workingSet=true

Experiment 2 — harder multi-turn stress set

We then ran a more demanding carryover test with:

  • seeded rules
  • seeded open loops
  • distractor turns
  • a mid-run rule update
  • final recall check
  • final application/plan check

What this tested

This was meant to give workingSet a fairer chance on the scenario it should theoretically help with:

  • carrying constraints across turns
  • preserving updated rules instead of stale ones
  • surviving distractors without losing task shape

Result

Both arms correctly carried:

  • all 4 active rules
  • all 3 open loops
  • the updated rule version (not the stale earlier version)
  • the final applied mini-plan

workingSet.enabled=true

  • avg beforeChars ≈ 2937
  • avg droppedCount = 2
  • avg truncatedChars ≈ 1263
  • still paying ~900 chars / turn for the backbone

workingSet.enabled=false

  • avg beforeChars ≈ 1991
  • avg droppedCount = 1
  • avg truncatedChars ≈ 551

Interpretation

Even on the harder multi-turn carryover test:

  • we still saw no visible answer-quality win from workingSet=true
  • we still saw materially higher context pressure from workingSet=true

That made the earlier result much less likely to be a weak-probe artifact.


Current conclusion

With current shaping and current budget settings:

  • workingSet.enabled=true appears to be costly but not yet sufficiently discriminative
  • workingSet.enabled=false currently looks like the better default

In other words:

the feature is not obviously wrong, but its current compression/backbone strategy is not yet earning its keep.


Where optimization might actually help

Instead of treating this as a binary keep/remove question, I think the next iteration should target the backbone economics:

1) Make working set more selective

Right now the backbone feels too eager / too broad.
Potential fixes:

  • stricter section admission
  • stronger relevance thresholding
  • fewer items per section
  • tighter decay / recency weighting
  • don't emit a section unless it materially changes action quality

2) Make backbone adaptive to budget pressure

If the lane is already near budget pressure, the backbone should shrink aggressively or disappear.
Possible knobs:

  • dynamic maxChars instead of a fixed ~900-char chunk
  • percentage-of-budget cap
  • skip working-set generation when recall/budget pressure is already high
  • degrade from structured summary -> minimal delta summary

3) Prefer delta summaries over restating the whole spine

A big reason the feature feels expensive is that it can restate too much stable information.
Potential direction:

  • generate only the changed constraints / decisions / open loops since the prior turn
  • or persist a compact internal representation and render only the needed slice for the current prompt

4) Measure on more adversarial long-horizon cases

We have not yet proven workingSet is never useful.
The next fair win-test would be something like:

  • much longer horizon
  • more distractor turns
  • more late recall
  • stale-rule replacement pressure
  • multiple competing open loops / task branches

If workingSet can’t win there either, the case for default-off becomes much stronger upstream too.

5) Distinguish durable policy/context from task-working-set context

One suspicion: we may be mixing relatively durable carryover with ephemeral task-local state in one summary product.
That likely makes the summary fatter than needed.
Maybe these should be treated separately:

  • durable policy / preferences
  • task-local open loops
  • turn-local scratchpad / delta

Ask / next step for the repo

I’d treat this as a design follow-up issue:

  • keep workingSet conceptually alive
  • optimize for smaller, more discriminative, more adaptive summaries
  • rerun the A/B with stronger long-horizon probes after the shaping changes

If useful, I can later convert this into a tighter benchmark card / reproducible eval harness summary for the repo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions