workingSet A/B: no observed quality lift yet; consistent context-pressure cost

## Summary

We ran two A/B experiments on `workingSet.enabled=true` vs `false` in an isolated `openclaw-mem-engine` evaluation lane.

Current result:

- we did **not** observe visible answer-quality lift from `workingSet.enabled=true`
- we **did** observe consistent extra context pressure from the working-set backbone
- current evidence favors **`workingSet.enabled=false` as the cleaner default** until the feature is improved or a stronger win case is demonstrated

This issue is **not** "delete the feature".
It is: the current backbone shape appears too expensive relative to the quality gain we can currently measure.

---

## Why we looked at this

We were seeing a recurring pattern where a compact but somewhat generic backbone summary would show up early and consume budget.
Hypothesis:

- maybe `workingSet` improves long-session carryover and rule persistence
- or maybe it mostly adds cost / pressure without enough lift

We wanted receipts instead of vibes.

---

## Comparison method

We normalized the evaluation harness before the final comparison and then ran:

- serialized arms: **A then B**
- isolated per-arm state
- read-only evaluation posture
- no auto-capture writeback
- explicit multi-turn session carryover when needed

This was done to compare the feature behavior itself rather than incidental test-run noise.

---

## Experiment 1 — fixed probe set (7 probes)

Goal:

- test rule recall / constraint carryover / rule application
- compare answer quality vs context pressure

### Result

Both arms answered the 7 probes correctly.

#### `workingSet.enabled=true`

- avg `beforeChars ≈ 2946`
- avg `droppedCount = 2`
- avg `truncatedChars ≈ 1257`
- working set itself consumed about `900 chars` per turn

#### `workingSet.enabled=false`

- avg `beforeChars ≈ 2001`
- avg `droppedCount = 1`
- avg `truncatedChars ≈ 579`

### Interpretation

For this probe set:

- **no visible quality lift** from `workingSet=true`
- **clear increase in budget pressure** from `workingSet=true`

---

## Experiment 2 — harder multi-turn stress set

We then ran a more demanding carryover test with:

- seeded rules
- seeded open loops
- distractor turns
- a mid-run rule update
- final recall check
- final application/plan check

### What this tested

This was meant to give `workingSet` a fairer chance on the scenario it should theoretically help with:

- carrying constraints across turns
- preserving updated rules instead of stale ones
- surviving distractors without losing task shape

### Result

Both arms correctly carried:

- all 4 active rules
- all 3 open loops
- the updated rule version (not the stale earlier version)
- the final applied mini-plan

#### `workingSet.enabled=true`

- avg `beforeChars ≈ 2937`
- avg `droppedCount = 2`
- avg `truncatedChars ≈ 1263`
- still paying ~`900 chars` / turn for the backbone

#### `workingSet.enabled=false`

- avg `beforeChars ≈ 1991`
- avg `droppedCount = 1`
- avg `truncatedChars ≈ 551`

### Interpretation

Even on the harder multi-turn carryover test:

- we still saw **no visible answer-quality win** from `workingSet=true`
- we still saw **materially higher context pressure** from `workingSet=true`

That made the earlier result much less likely to be a weak-probe artifact.

---

## Current conclusion

With current shaping and current budget settings:

- `workingSet.enabled=true` appears to be **costly but not yet sufficiently discriminative**
- `workingSet.enabled=false` currently looks like the better default

In other words:

> the feature is not obviously wrong, but its current compression/backbone strategy is not yet earning its keep.

---

## Where optimization might actually help

Instead of treating this as a binary keep/remove question, I think the next iteration should target the backbone economics:

### 1) Make working set more selective

Right now the backbone feels too eager / too broad.
Potential fixes:

- stricter section admission
- stronger relevance thresholding
- fewer items per section
- tighter decay / recency weighting
- don't emit a section unless it materially changes action quality

### 2) Make backbone adaptive to budget pressure

If the lane is already near budget pressure, the backbone should shrink aggressively or disappear.
Possible knobs:

- dynamic `maxChars` instead of a fixed ~900-char chunk
- percentage-of-budget cap
- skip working-set generation when recall/budget pressure is already high
- degrade from structured summary -> minimal delta summary

### 3) Prefer delta summaries over restating the whole spine

A big reason the feature feels expensive is that it can restate too much stable information.
Potential direction:

- generate only the *changed* constraints / decisions / open loops since the prior turn
- or persist a compact internal representation and render only the needed slice for the current prompt

### 4) Measure on more adversarial long-horizon cases

We have not yet proven `workingSet` is never useful.
The next fair win-test would be something like:

- much longer horizon
- more distractor turns
- more late recall
- stale-rule replacement pressure
- multiple competing open loops / task branches

If `workingSet` can’t win there either, the case for default-off becomes much stronger upstream too.

### 5) Distinguish durable policy/context from task-working-set context

One suspicion: we may be mixing relatively durable carryover with ephemeral task-local state in one summary product.
That likely makes the summary fatter than needed.
Maybe these should be treated separately:

- durable policy / preferences
- task-local open loops
- turn-local scratchpad / delta

---

## Ask / next step for the repo

I’d treat this as a design follow-up issue:

- keep `workingSet` conceptually alive
- optimize for **smaller, more discriminative, more adaptive** summaries
- rerun the A/B with stronger long-horizon probes after the shaping changes

If useful, I can later convert this into a tighter benchmark card / reproducible eval harness summary for the repo.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workingSet A/B: no observed quality lift yet; consistent context-pressure cost #67

Summary

Why we looked at this

Comparison method

Experiment 1 — fixed probe set (7 probes)

Result

`workingSet.enabled=true`

`workingSet.enabled=false`

Interpretation

Experiment 2 — harder multi-turn stress set

What this tested

Result

`workingSet.enabled=true`

`workingSet.enabled=false`

Interpretation

Current conclusion

Where optimization might actually help

1) Make working set more selective

2) Make backbone adaptive to budget pressure

3) Prefer delta summaries over restating the whole spine

4) Measure on more adversarial long-horizon cases

5) Distinguish durable policy/context from task-working-set context

Ask / next step for the repo

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

workingSet A/B: no observed quality lift yet; consistent context-pressure cost #67

Description

Summary

Why we looked at this

Comparison method

Experiment 1 — fixed probe set (7 probes)

Result

workingSet.enabled=true

workingSet.enabled=false

Interpretation

Experiment 2 — harder multi-turn stress set

What this tested

Result

workingSet.enabled=true

workingSet.enabled=false

Interpretation

Current conclusion

Where optimization might actually help

1) Make working set more selective

2) Make backbone adaptive to budget pressure

3) Prefer delta summaries over restating the whole spine

4) Measure on more adversarial long-horizon cases

5) Distinguish durable policy/context from task-working-set context

Ask / next step for the repo

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`workingSet.enabled=true`

`workingSet.enabled=false`

`workingSet.enabled=true`

`workingSet.enabled=false`