Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions evals/draft-to-design-doc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
jobs/
.env
__pycache__/
64 changes: 64 additions & 0 deletions evals/draft-to-design-doc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Draft → Design Doc

Quick eval comparing how Claude Code extracts a Design Doc JSON from a Markdown design draft, across three levels of scaffolding:

| Variant | Task | Prompt | Plugin / MCP | Expected output |
|---|---|---|---|---|
| `vanilla` | `threshold-discount-extraction` | minimal | none | `/app/output/design-doc.json` |
| `guided` | `threshold-discount-extraction` | extended (mirrors `analyze-design-draft` reference) | none | `/app/output/design-doc.json` |
| `noesis` | `threshold-discount-extraction-noesis` | "use the skill" | full plugin + `noesis-graph` MCP | `/app/noesis/design-docs/<slug>-<id>.json` |

## TDD note (read this first)

`tests/test.sh` is **TDD-style** — it asserts the *expected correct behaviour*, not the current behaviour of the noesis plugin. It is **identical** for all three variants (one spec, three implementations).

Currently `noesis-graph:save_design_doc` enforces a green-field ChangeSet shape: every collection's `added` is populated and `modified` / `removed` are empty. That's a bug — the Design Doc is supposed to express the diff against the implemented codebase. The DDD-starter-dotnet repo is mounted at `/app/repo/`; the agent should use it as the diff baseline.

The test asserts:

- `Sales` BC in `boundedContexts.modified` (it already exists in `/app/repo/Sources/Sales/`)
- `Sales.Pricing.Discounts` module in `Sales.modules.modified` (already exists)
- `ThresholdDiscount` in `buildingBlocks.added` under that module (genuinely new)
- `Discount` in `buildingBlocks.modified` (already exists; the draft adds a new `Threshold` factory behaviour)

Until the green-field constraint is lifted, **all three variants will FAIL**. That's the intended TDD signal — when the bug is fixed, runs that produce the correct shape start passing.

Each task ships its own copy of `test.sh`, but the assertions are byte-for-byte identical. The two paths exist only because Harbor binds `task.toml` → one Dockerfile → one verifier per task, and the noesis variant needs a different sandbox (Bun + plugin install + MCP server).

## Auth

```bash
source ~/Documents/git/noesis/sdlc-projects/nasde-toolkit/scripts/export_oauth_token.sh
```

## Run

The two non-noesis variants share a single task:

```bash
nasde run --variant vanilla --tasks threshold-discount-extraction --without-eval -C evals/draft-to-design-doc
nasde run --variant guided --tasks threshold-discount-extraction --without-eval -C evals/draft-to-design-doc
```

The noesis variant has a dedicated task (different sandbox: bun + plugin install + MCP server). Stage the plugin sources into the docker build context once, then run:

```bash
bash evals/draft-to-design-doc/tasks/threshold-discount-extraction-noesis/environment/prepare-context.sh
nasde run --variant noesis --tasks threshold-discount-extraction-noesis --without-eval -C evals/draft-to-design-doc
```

Re-run `prepare-context.sh` whenever the plugin sources under `src/agent_extensions/plugins/noesis/` change — it's a `rsync` mirror so it stays cheap.

## Inspect output

```bash
LATEST=$(ls -td evals/draft-to-design-doc/jobs/*__<variant>__*/ | head -1)
DOC=$(find "$LATEST" -name "design-doc.json" -o -name "*.json" -path "*noesis/design-docs/*" | head -1)
jq '.' "$DOC" | less
```

## Notes

- `--without-eval` skips the LLM-as-Judge phase. `assessment_dimensions.json` is intentionally empty here — quality grading is owned by manual inspection until the dimensions are designed.
- The vanilla and guided variants do NOT pass `--with-opik` (no Opik tracking by default). Add the flag when comparing across runs.
- Per-variant results live under `jobs/<timestamp>__<variant>__<suffix>/<task>__<trial>/`. The produced design doc is in `artifacts/workspace/output/design-doc.json` (vanilla/guided) or `artifacts/workspace/noesis/design-docs/<file>.json` (noesis).
3 changes: 3 additions & 0 deletions evals/draft-to-design-doc/assessment_dimensions.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"dimensions": []
}
20 changes: 20 additions & 0 deletions evals/draft-to-design-doc/nasde.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[project]
name = "draft-to-design-doc"
version = "1.0.0"

[defaults]
variant = "vanilla"
model = "claude-sonnet-4-6"
timeout_sec = 600

[docker]
base_image = "ubuntu:22.04"
build_commands = []

[evaluation]
model = "claude-sonnet-4-6"
dimensions_file = "assessment_dimensions.json"

[reporting]
platform = "opik"
project_name = "draft-to-design-doc"
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Threshold-activated percentage discount

Author: Sales Product Team + Architecture
Date: 2026-05-07
Status: Draft for review

## Background

Our sales reps already have two ways to grant discounts on a single product:

- **Percentage discount** — e.g. 10% off, applied to any price.
- **Value discount** — e.g. 50 PLN off, capped at the price (never goes negative).

Both live in `Sales.DeepModel.Pricing.Discounts` as immutable value objects, and they are exposed to the rest of the system through a single `Discount` discriminated-union type (so callers don't have to know which variant they hold).

A percentage discount applies regardless of the underlying price. Reps have been complaining that this is too blunt for premium products: a 10% promotion on a 50-PLN accessory doesn't move the needle, but the same promo on a 600-PLN device is a real margin hit. They want a discount type that **only kicks in once the price clears a certain bar** — below that bar, the customer pays the original price.

## Business requirement

Add a third kind of discount: a **threshold-activated percentage discount**.

It carries two parameters:

- a **percentage** (same shape as in `PercentageDiscount`),
- a **price threshold** (a `Money` amount that the price must *exceed* to activate).

Behaviour when applied to a price:

- if the price is **above** the threshold → return the price reduced by the percentage,
- if the price is **at** the threshold or **below** it → return the price unchanged.

The new discount must be usable everywhere the existing `Discount` type is used today — no new caller, no parallel union, no separate wiring. From the outside, a `Discount` should now have three variants instead of two.

## Worked examples

| Price | Threshold | Percentage | Result | Why |
|---|---|---|---|---|
| 600 PLN | 500 PLN | 10% | 540 PLN | price > threshold → discount applies |
| 500 PLN | 500 PLN | 10% | 500 PLN | price == threshold → no discount (strict greater-than) |
| 300 PLN | 500 PLN | 10% | 300 PLN | price < threshold → unchanged |
| 1000 PLN | 0 PLN | 25% | 750 PLN | any positive price clears a 0 threshold |

## Acceptance criteria

1. The `Discount` discriminated union has a **third variant**. Existing call sites that build a percentage or value discount keep working unchanged — no breaking change to their public surface.
2. `Discount.ApplyOn(price)` dispatches correctly to the new variant.
3. Construction of the new variant is **fail-fast**: an instance with a percentage outside the legal range, or with a non-positive threshold, cannot exist.
4. Existing `PercentageDiscount` and `ValueDiscount` are **not modified** — their behaviour and tests stay green.
5. Tests cover at minimum: above-threshold, at-threshold (boundary), below-threshold, zero-price, factory rejection on invalid percentage, factory rejection on non-positive threshold, equality of two equivalent instances.

## Architectural decisions

These were resolved during the design review on 2026-05-06. They are not up for re-debate during implementation; if a constraint conflicts with one of them, escalate.

### A1. New value object: `ThresholdDiscount`

A new Building Block in the `Sales.DeepModel.Pricing.Discounts` module. Type: **value object**, mirroring the shape of `PercentageDiscount` and `ValueDiscount`:

- C# `readonly struct` annotated with `[DddValueObject]`.
- Implements the existing `PriceModifier` interface (provides `ApplyOn(Money price)`).
- Implements `IEquatable<ThresholdDiscount>`, with `Equals(object?)`, `GetHashCode()`, and a sensible `ToString()` consistent with the two existing discount value objects.
- Internal state is **two private fields**: a `Percentage` and a `Money` threshold. No public getters — equality, application and `ToString` are the only externally observable behaviours.
- Construction goes through a **static factory method** (`Of(Percentage value, Money threshold)`, mirroring `PercentageDiscount.Of` and `ValueDiscount.Of`). The constructor is private. The factory enforces the invariant from §A4.

### A2. `ApplyOn(Money price)` semantics

`ApplyOn` returns:

- `price * (Percentage.Of100 - percentage)` when `price > threshold`,
- `price` otherwise.

The `>` comparison is **strict** (price equal to threshold returns `price` unchanged). This matches the worked examples and the way reps describe the rule ("the price has to *exceed* the threshold").

### A3. `Discount` union — third variant

The `Discount` discriminated union in `Discount.cs` is **modified**, not replaced or paralleled:

- A new private discriminator state must be introduced. The current `bool _isPercentage` is no longer sufficient; replace it with a small enum-like discriminator (e.g. a private nested enum or three named constants) so that the three cases are exhaustively distinguishable inside `ApplyOn`. Keep the field private.
- Add a third field `private readonly ThresholdDiscount _thresholdDiscount;` next to the two existing variant fields.
- Add a public static factory `Discount.Threshold(Percentage value, Money threshold)`, mirroring `Discount.Percentage(...)` and `Discount.Value(...)`.
- `ApplyOn(Money price)` dispatches on the discriminator across all three branches.
- `Equals`, `GetHashCode` and `ToString` continue to cover all three variants.
- The two existing factory methods (`Discount.Percentage`, `Discount.Value`) keep their signature exactly as today — call sites must not need any change.

### A4. Construction invariants

`ThresholdDiscount.Of(Percentage value, Money threshold)` rejects invalid inputs at construction time:

- `value` outside `[0%, 100%]` → reject (this validation already lives in the `Percentage` value object; rely on it, do not duplicate).
- `threshold` non-positive (≤ zero in its currency) → reject with a clear domain error indicating the threshold must be strictly positive.

Apply-time code does **not** re-validate. The factory is the only invariant gatekeeper.

### A5. Persistence

The SQL repository in `Sales.Adapters/Pricing/Discounts/DiscountsSqlRepository.cs` already round-trips the `Discount` union. Extend it minimally to handle the third variant — same shape of change as the two existing variants. No schema migration is in scope of this draft (the storage representation is whatever the repository today maps onto; if the existing serialisation is shape-flexible enough, no DB change is needed).

### A6. Out of scope

- Multi-product / cart-level threshold discounts (this is per single product price).
- Time-bounded promotions (a discount valid only in a date range).
- Stacking rules between discount types — unchanged.
- Renaming or refactoring the existing `PercentageDiscount`, `ValueDiscount`, or the `Discount` discriminator field.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
_plugin-staging/node_modules
_plugin-staging/.serena
_plugin-staging/noesis
_plugin-staging/.git
_plugin-staging/**/__pycache__
_plugin-staging/**/.DS_Store
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
_plugin-staging/
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Sandbox for the Noesis variant of the threshold-discount-extraction task.
#
# Layers:
# 1. Ubuntu base + apt deps (jq, curl, ca-certs, python3 for prepare.ts).
# 2. Bun install (cached unless the install URL changes).
# 3. Plugin source copy (re-runs only when plugin sources change).
# 4. bun install + plugin registration (re-runs only when sources change).
#
# The noesis-graph MCP server is started by Harbor via the [[environment.mcp_servers]]
# entry in task.toml: `cd /opt/noesis-plugin && bun run mcp/noesis-graph/server.ts`.
# The skill is registered as a Claude Code plugin under /root/.claude/plugins/noesis/.

# ubuntu:24.04 ships GLIBC 2.39 — required by `lbug` (ladybug-db) native module.
# Older ubuntu:22.04 has GLIBC 2.35 and fails to dlopen lbugjs.node.
FROM ubuntu:24.04

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
jq \
curl \
ca-certificates \
python3 \
git \
unzip \
&& rm -rf /var/lib/apt/lists/*

# Install Bun. We pin a version to keep builds reproducible.
RUN curl -fsSL https://bun.sh/install | BUN_INSTALL=/usr/local bash
ENV PATH="/usr/local/bin:${PATH}"
RUN bun --version

# Plugin sources are staged into ./_plugin-staging/ next to this Dockerfile by
# the prepare-context.sh helper before `nasde run` is invoked. Harbor's docker
# build context is the directory containing this Dockerfile, so a relative COPY
# is the only portable way to pull plugin sources in.
#
# .dockerignore (sibling file) excludes node_modules, .serena, noesis/ scratch
# dirs and other host-only artefacts so the COPY stays small (~6 MB).
COPY _plugin-staging/ /opt/noesis-plugin/

WORKDIR /opt/noesis-plugin
RUN bun install --production --frozen-lockfile

# Register the plugin so Claude Code in the sandbox finds the noesis:analyze-design-draft skill.
# Claude Code looks for plugins under ~/.claude/plugins/<name>/ with a .claude-plugin/plugin.json manifest.
RUN mkdir -p /root/.claude/plugins && \
ln -s /opt/noesis-plugin /root/.claude/plugins/noesis && \
ls /root/.claude/plugins/noesis/.claude-plugin/

# CLAUDE_PLUGIN_ROOT is referenced by SKILL.md scripts at runtime — set it so
# any "${CLAUDE_PLUGIN_ROOT}/..." path inside the skill resolves correctly.
ENV CLAUDE_PLUGIN_ROOT=/opt/noesis-plugin

# Workdir where the agent will operate. The skill will write Design Doc JSON
# under /app/noesis/design-docs/ via save_design_doc.
WORKDIR /app
RUN mkdir -p /app/noesis/design-docs /app/noesis/conversations /app/noesis/documents /app/noesis/topics /app/noesis/decisions /app/noesis/design-drafts

# Clone the target codebase so the agent (and the noesis-graph scan_to_tmp tool)
# can verify what is already implemented. The skill's Step 6 expects the diff
# baseline to be the implemented codebase. Without the repo the agent cannot
# tell that the existing `Sales` BC, `Sales.Pricing.Discounts` module and
# the `Discount` / `PercentageDiscount` / `ValueDiscount` value objects already
# exist — and therefore must be classified as `modified` rather than `added`.
RUN git clone --depth 1 https://github.com/itlibrium/DDD-starter-dotnet.git /app/repo

CMD ["/bin/bash"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/bin/bash
# Stage the noesis plugin sources into ./_plugin-staging/ so the Dockerfile
# can COPY them. Run this once (or whenever plugin sources change) BEFORE
# `nasde run --variant noesis -C evals/draft-to-design-doc`.
#
# The staging dir is git-ignored.
#
# Usage (from anywhere):
# bash evals/draft-to-design-doc/tasks/threshold-discount-extraction-noesis/environment/prepare-context.sh
#
set -euo pipefail

HERE="$(cd "$(dirname "$0")" && pwd)"
STAGING="$HERE/_plugin-staging"
PLUGIN_SRC="$(cd "$HERE/../../../../../src/agent_extensions/plugins/noesis" && pwd)"

echo "[prepare-context] plugin source: $PLUGIN_SRC"
echo "[prepare-context] staging dir: $STAGING"

# Use rsync to mirror only what we need. Excludes match the .dockerignore so the
# Docker build context stays small.
mkdir -p "$STAGING"
rsync -a --delete \
--exclude=node_modules \
--exclude=.serena \
--exclude=noesis \
--exclude=.git \
--exclude=__pycache__ \
--exclude=.DS_Store \
"$PLUGIN_SRC/" "$STAGING/"

echo "[prepare-context] staged $(du -sh "$STAGING" | cut -f1) into _plugin-staging/"
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Extract a Design Doc using the noesis:analyze-design-draft skill

The Noesis plugin is installed in this sandbox and exposes the `noesis:analyze-design-draft` skill plus the `noesis-graph` MCP server. Use them.

## Inputs

- `/app/draft.md` — the design draft to analyse.
- `/app/repo/` — the target codebase the Design Doc describes (DDD-starter-dotnet, cloned from upstream). The skill / MCP `scan_to_tmp` tool can use this to determine what is already implemented and what is genuinely new.

## Output

- The skill persists the extracted Design Doc as a JSON file under `/app/noesis/design-docs/`. The filename is derived from the design doc's name and id by `prepare_design_doc_path`. Do not write the JSON manually — let the skill do it via `save_design_doc`.

## How to invoke the skill

Invoke the skill with the draft path and a design-doc title:

```
@/app/draft.md title="Threshold-activated discount" date=2026-05-08 main_topic="Discount value objects in Sales pricing" design_doc_title="threshold-activated-discount"
```

When the skill asks (during Setup) for the `design_doc_path`, answer:

```
/app/noesis/design-docs/
```

(the skill computes the canonical filename from the title and id; just give it the directory under `noesis/`).

The skill will:

1. Run `prepare.ts` on the draft → working dir + section tree + fragment list.
2. Find existing topics in the graph (Goldilocks).
3. Categorise fragments + assign topics.
4. Find existing decisions.
5. Review topics, generate summaries, extract decisions.
6. Extract the design model (per `extract-design-model.md` + `design-doc-schema.md`) and persist via `save_design_doc`. The diff baseline is the implemented codebase — call `noesis-graph:scan_to_tmp` first if you need to see what already exists in `/app/repo/`.
7. Merge topics, fragments and decisions via `merge_document`.

Follow the skill's workflow exactly; don't shortcut steps. The MCP server is `noesis-graph` (already registered in the sandbox's Claude config).

## Diff baseline

Per `extract-design-model.md` and `design-doc-schema.md` Section 3, the Design Doc is a diff against the **currently implemented codebase** under `/app/repo/`. For each element, ask: *is this already in /app/repo/, and is the draft changing it?*

- **Already in /app/repo, draft changes it** → goes in `modified`.
- **Not in /app/repo** → goes in `added`.
- **Already in /app/repo, no change** → omit.
- **In /app/repo, draft removes it** → `removed` (by name).

For this draft specifically:
- `Sales` BC already exists in `/app/repo/Sources/Sales/`. → `boundedContexts.modified`.
- `Sales.Pricing.Discounts` module already exists. → `modules.modified` under Sales.
- `Discount`, `PercentageDiscount`, `ValueDiscount` value objects already exist. → `modified` if changed, else omitted.
- `ThresholdDiscount` is genuinely new. → `buildingBlocks.added`.

## Rules

- Do not modify `/app/draft.md` or `/app/repo/`.
- Do not write JSON files outside the path the `save_design_doc` MCP tool returns.
- Do not bypass the skill — write the design doc through `save_design_doc`, never with `Write` directly.
Loading