diff --git a/autoplan/SKILL.md b/autoplan/SKILL.md index 597188975..62e7624c8 100644 --- a/autoplan/SKILL.md +++ b/autoplan/SKILL.md @@ -546,7 +546,7 @@ Override: every AskUserQuestion → auto-decide using the 6 principles. What alternatives were dismissed too quickly? What competitive or market risks are unaddressed? What scope decisions will look foolish in 6 months? Be adversarial. No compliments. Just the strategic blind spots. - File: " -s read-only --enable web_search_cached` + File: " -s read-only --search` Timeout: 10 minutes **Claude CEO subagent** (via Agent tool): @@ -657,7 +657,7 @@ Override: every AskUserQuestion → auto-decide using the 6 principles. accessibility requirements (keyboard nav, contrast, touch targets) specified or aspirational? Does the plan describe specific UI decisions or generic patterns? What design decisions will haunt the implementer if left ambiguous? - Be opinionated. No hedging." -s read-only --enable web_search_cached` + Be opinionated. No hedging." -s read-only --search` Timeout: 10 minutes **Claude design subagent** (via Agent tool): @@ -722,7 +722,7 @@ Override: every AskUserQuestion → auto-decide using the 6 principles. CEO: Design: - File: " -s read-only --enable web_search_cached` + File: " -s read-only --search` Timeout: 10 minutes **Claude eng subagent** (via Agent tool): @@ -737,7 +737,6 @@ Override: every AskUserQuestion → auto-decide using the 6 principles. NO prior-phase context — subagent must be truly independent. Error handling: same as Phase 1 (non-blocking, degradation matrix applies). - - Architecture choices: explicit over clever (P5). If codex disagrees with valid reason → TASTE DECISION. - Evals: always include all relevant suites (P1) - Test plan: generate artifact at `~/.gstack/projects/$SLUG/{user}-{branch}-test-plan-{datetime}.md` diff --git a/autoplan/SKILL.md.tmpl b/autoplan/SKILL.md.tmpl index c4e57441c..1c7bec804 100644 --- a/autoplan/SKILL.md.tmpl +++ b/autoplan/SKILL.md.tmpl @@ -203,7 +203,7 @@ Override: every AskUserQuestion → auto-decide using the 6 principles. What alternatives were dismissed too quickly? What competitive or market risks are unaddressed? What scope decisions will look foolish in 6 months? Be adversarial. No compliments. Just the strategic blind spots. - File: " -s read-only --enable web_search_cached` + File: " -s read-only --search` Timeout: 10 minutes **Claude CEO subagent** (via Agent tool): @@ -314,7 +314,7 @@ Override: every AskUserQuestion → auto-decide using the 6 principles. accessibility requirements (keyboard nav, contrast, touch targets) specified or aspirational? Does the plan describe specific UI decisions or generic patterns? What design decisions will haunt the implementer if left ambiguous? - Be opinionated. No hedging." -s read-only --enable web_search_cached` + Be opinionated. No hedging." -s read-only --search` Timeout: 10 minutes **Claude design subagent** (via Agent tool): @@ -379,7 +379,7 @@ Override: every AskUserQuestion → auto-decide using the 6 principles. CEO: Design: - File: " -s read-only --enable web_search_cached` + File: " -s read-only --search` Timeout: 10 minutes **Claude eng subagent** (via Agent tool): @@ -394,7 +394,6 @@ Override: every AskUserQuestion → auto-decide using the 6 principles. NO prior-phase context — subagent must be truly independent. Error handling: same as Phase 1 (non-blocking, degradation matrix applies). - - Architecture choices: explicit over clever (P5). If codex disagrees with valid reason → TASTE DECISION. - Evals: always include all relevant suites (P1) - Test plan: generate artifact at `~/.gstack/projects/$SLUG/{user}-{branch}-test-plan-{datetime}.md` diff --git a/bun.lockb b/bun.lockb new file mode 100755 index 000000000..38b501a7e Binary files /dev/null and b/bun.lockb differ diff --git a/codex/SKILL.md b/codex/SKILL.md index 0449990c1..d59392c1c 100644 --- a/codex/SKILL.md +++ b/codex/SKILL.md @@ -373,13 +373,13 @@ TMPERR=$(mktemp /tmp/codex-err-XXXXXX.txt) 2. Run the review (5-minute timeout): ```bash -codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex exec review --base -c 'model_reasoning_effort="high"' --json 2>"$TMPERR" ``` Use `timeout: 300000` on the Bash call. If the user provided custom instructions (e.g., `/codex review focus on security`), pass them as the prompt argument: ```bash -codex review "focus on security" --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex exec review "focus on security" --base -c 'model_reasoning_effort="high"' --json 2>"$TMPERR" ``` 3. Capture the output. Then parse cost from stderr: @@ -517,7 +517,7 @@ With focus (e.g., "security"): 2. Run codex exec with **JSONL output** to capture reasoning traces and tool calls (5-minute timeout): ```bash -codex exec "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>/dev/null | python3 -c " +codex exec "" -s read-only -c 'model_reasoning_effort="high"' --search --json 2>/dev/null | python3 -c " import sys, json for line in sys.stdin: line = line.strip() @@ -602,7 +602,7 @@ THE PLAN: For a **new session:** ```bash -codex exec "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " +codex exec "" -s read-only -c 'model_reasoning_effort="high"' --search --json 2>"$TMPERR" | python3 -c " import sys, json for line in sys.stdin: line = line.strip() @@ -635,7 +635,7 @@ for line in sys.stdin: For a **resumed session** (user chose "Continue"): ```bash -codex exec resume "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " +codex exec resume "" -s read-only -c 'model_reasoning_effort="high"' --search --json 2>"$TMPERR" | python3 -c " " ``` @@ -671,10 +671,10 @@ Session saved — run /codex again to continue this conversation. agentic coding model). This means as OpenAI ships newer models, /codex automatically uses them. If the user wants a specific model, pass `-m` through to codex. -**Reasoning effort:** All modes use `xhigh` — maximum reasoning power. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible. +**Reasoning effort:** All modes use `high` — the maximum reasoning power available. Valid values are `minimal`, `low`, `medium`, `high`. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible. -**Web search:** All codex commands use `--enable web_search_cached` so Codex can look up -docs and APIs during review. This is OpenAI's cached index — fast, no extra cost. +**Web search:** Challenge and consult modes use `--search` so Codex can look up +docs and APIs. Review mode (`codex exec review`) does not need `--search` — it has its own built-in review prompt. If the user specifies a model (e.g., `/codex review -m gpt-5.1-codex-max` or `/codex challenge -m gpt-5.2`), pass the `-m` flag through to codex. diff --git a/codex/SKILL.md.tmpl b/codex/SKILL.md.tmpl index 0aa7fec67..0249fcff9 100644 --- a/codex/SKILL.md.tmpl +++ b/codex/SKILL.md.tmpl @@ -79,13 +79,13 @@ TMPERR=$(mktemp /tmp/codex-err-XXXXXX.txt) 2. Run the review (5-minute timeout): ```bash -codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex exec review --base -c 'model_reasoning_effort="high"' --json 2>"$TMPERR" ``` Use `timeout: 300000` on the Bash call. If the user provided custom instructions (e.g., `/codex review focus on security`), pass them as the prompt argument: ```bash -codex review "focus on security" --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex exec review "focus on security" --base -c 'model_reasoning_effort="high"' --json 2>"$TMPERR" ``` 3. Capture the output. Then parse cost from stderr: @@ -158,7 +158,7 @@ With focus (e.g., "security"): 2. Run codex exec with **JSONL output** to capture reasoning traces and tool calls (5-minute timeout): ```bash -codex exec "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>/dev/null | python3 -c " +codex exec "" -s read-only -c 'model_reasoning_effort="high"' --search --json 2>/dev/null | python3 -c " import sys, json for line in sys.stdin: line = line.strip() @@ -243,7 +243,7 @@ THE PLAN: For a **new session:** ```bash -codex exec "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " +codex exec "" -s read-only -c 'model_reasoning_effort="high"' --search --json 2>"$TMPERR" | python3 -c " import sys, json for line in sys.stdin: line = line.strip() @@ -276,7 +276,7 @@ for line in sys.stdin: For a **resumed session** (user chose "Continue"): ```bash -codex exec resume "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " +codex exec resume "" -s read-only -c 'model_reasoning_effort="high"' --search --json 2>"$TMPERR" | python3 -c " " ``` @@ -312,10 +312,10 @@ Session saved — run /codex again to continue this conversation. agentic coding model). This means as OpenAI ships newer models, /codex automatically uses them. If the user wants a specific model, pass `-m` through to codex. -**Reasoning effort:** All modes use `xhigh` — maximum reasoning power. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible. +**Reasoning effort:** All modes use `high` — the maximum reasoning power available. Valid values are `minimal`, `low`, `medium`, `high`. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible. -**Web search:** All codex commands use `--enable web_search_cached` so Codex can look up -docs and APIs during review. This is OpenAI's cached index — fast, no extra cost. +**Web search:** Challenge and consult modes use `--search` so Codex can look up +docs and APIs. Review mode (`codex exec review`) does not need `--search` — it has its own built-in review prompt. If the user specifies a model (e.g., `/codex review -m gpt-5.1-codex-max` or `/codex challenge -m gpt-5.2`), pass the `-m` flag through to codex. diff --git a/design-consultation/SKILL.md b/design-consultation/SKILL.md index 4dafc63f9..f63c0bdad 100644 --- a/design-consultation/SKILL.md +++ b/design-consultation/SKILL.md @@ -453,7 +453,7 @@ codex exec "Given this product context, propose a complete design direction: - Differentiation: 2 deliberate departures from category norms - Anti-slop: no purple gradients, no 3-column icon grids, no centered everything, no decorative blobs -Be opinionated. Be specific. Do not hedge. This is YOUR design direction — own it." -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached 2>"$TMPERR_DESIGN" +Be opinionated. Be specific. Do not hedge. This is YOUR design direction — own it." -s read-only -c 'model_reasoning_effort="medium"' --search 2>"$TMPERR_DESIGN" ``` Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr: ```bash diff --git a/design-review/SKILL.md b/design-review/SKILL.md index 0fc6d0c73..ba5a3b0df 100644 --- a/design-review/SKILL.md +++ b/design-review/SKILL.md @@ -993,7 +993,7 @@ HARD REJECTION — flag if ANY apply: 6. Carousel with no narrative purpose 7. App UI made of stacked cards instead of layout -Be specific. Reference file:line for every finding." -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_DESIGN" +Be specific. Reference file:line for every finding." -s read-only -c 'model_reasoning_effort="high"' --search 2>"$TMPERR_DESIGN" ``` Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr: ```bash diff --git a/docs/skills.md b/docs/skills.md index afbac0d2d..6a99c976f 100644 --- a/docs/skills.md +++ b/docs/skills.md @@ -716,9 +716,9 @@ When `/review` catches bugs from Claude's perspective, `/codex` brings a complet ### Three modes -**Review** — run `codex review` against the current diff. Codex reads every changed file, classifies findings by severity (P1 critical, P2 high, P3 medium), and returns a PASS/FAIL verdict. Any P1 finding = FAIL. The review is fully independent — Codex doesn't see Claude's review. +**Review** — run `/codex review` against the current diff. Codex reads every changed file, classifies findings by severity (P1 critical, P2 high, P3 medium), and returns a PASS/FAIL verdict. Any P1 finding = FAIL. The review is fully independent — Codex doesn't see Claude's review. -**Challenge** — adversarial mode. Codex actively tries to break your code. It looks for edge cases, race conditions, security holes, and assumptions that would fail under load. Uses maximum reasoning effort (`xhigh`). Think of it as a penetration test for your logic. +**Challenge** — adversarial mode. Codex actively tries to break your code. It looks for edge cases, race conditions, security holes, and assumptions that would fail under load. Uses maximum reasoning effort (`high`). Think of it as a penetration test for your logic. **Consult** — open conversation with session continuity. Ask Codex anything about the codebase. Follow-up questions reuse the same session, so context carries over. Great for "am I thinking about this correctly?" moments. diff --git a/office-hours/SKILL.md b/office-hours/SKILL.md index fa4437fc5..c44625759 100644 --- a/office-hours/SKILL.md +++ b/office-hours/SKILL.md @@ -687,7 +687,7 @@ Write the full prompt (context block + instructions) to this file. Use the mode- ```bash TMPERR_OH=$(mktemp /tmp/codex-oh-err-XXXXXXXX) -codex exec "$(cat "$CODEX_PROMPT_FILE")" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_OH" +codex exec "$(cat "$CODEX_PROMPT_FILE")" -s read-only -c 'model_reasoning_effort="high"' --search 2>"$TMPERR_OH" ``` Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr: @@ -838,7 +838,7 @@ If user chooses A, launch both voices simultaneously: 1. **Codex** (via Bash, `model_reasoning_effort="medium"`): ```bash TMPERR_SKETCH=$(mktemp /tmp/codex-sketch-XXXXXXXX) -codex exec "For this product approach, provide: a visual thesis (one sentence — mood, material, energy), a content plan (hero → support → detail → CTA), and 2 interaction ideas that change page feel. Apply beautiful defaults: composition-first, brand-first, cardless, poster not document. Be opinionated." -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached 2>"$TMPERR_SKETCH" +codex exec "For this product approach, provide: a visual thesis (one sentence — mood, material, energy), a content plan (hero → support → detail → CTA), and 2 interaction ideas that change page feel. Apply beautiful defaults: composition-first, brand-first, cardless, poster not document. Be opinionated." -s read-only -c 'model_reasoning_effort="medium"' --search 2>"$TMPERR_SKETCH" ``` Use a 5-minute timeout (`timeout: 300000`). After completion: `cat "$TMPERR_SKETCH" && rm -f "$TMPERR_SKETCH"` diff --git a/plan-ceo-review/SKILL.md b/plan-ceo-review/SKILL.md index 89422bb09..461114912 100644 --- a/plan-ceo-review/SKILL.md +++ b/plan-ceo-review/SKILL.md @@ -1044,7 +1044,7 @@ THE PLAN: ```bash TMPERR_PV=$(mktemp /tmp/codex-planreview-XXXXXXXX) -codex exec "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_PV" +codex exec "" -s read-only -c 'model_reasoning_effort="high"' --search 2>"$TMPERR_PV" ``` Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr: diff --git a/plan-design-review/SKILL.md b/plan-design-review/SKILL.md index 8bc69bbc4..27283f7d9 100644 --- a/plan-design-review/SKILL.md +++ b/plan-design-review/SKILL.md @@ -467,7 +467,7 @@ HARD RULES — first classify as MARKETING/LANDING PAGE vs APP UI vs HYBRID, the - APP UI: Calm surface hierarchy, dense but readable, utility language, minimal chrome - UNIVERSAL: CSS variables for colors, no default font stacks, one job per section, cards earn existence -For each finding: what's wrong, what will happen if it ships unresolved, and the specific fix. Be opinionated. No hedging." -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_DESIGN" +For each finding: what's wrong, what will happen if it ships unresolved, and the specific fix. Be opinionated. No hedging." -s read-only -c 'model_reasoning_effort="high"' --search 2>"$TMPERR_DESIGN" ``` Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr: ```bash diff --git a/plan-eng-review/SKILL.md b/plan-eng-review/SKILL.md index 278af7085..aa23000ec 100644 --- a/plan-eng-review/SKILL.md +++ b/plan-eng-review/SKILL.md @@ -722,7 +722,7 @@ THE PLAN: ```bash TMPERR_PV=$(mktemp /tmp/codex-planreview-XXXXXXXX) -codex exec "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_PV" +codex exec "" -s read-only -c 'model_reasoning_effort="high"' --search 2>"$TMPERR_PV" ``` Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr: diff --git a/review/SKILL.md b/review/SKILL.md index dd3f482de..9c3137c3e 100644 --- a/review/SKILL.md +++ b/review/SKILL.md @@ -459,7 +459,7 @@ If Codex is available, run a lightweight design check on the diff: ```bash TMPERR_DRL=$(mktemp /tmp/codex-drl-XXXXXXXX) -codex exec "Review the git diff on this branch. Run 7 litmus checks (YES/NO each): 1. Brand/product unmistakable in first screen? 2. One strong visual anchor present? 3. Page understandable by scanning headlines only? 4. Each section has one job? 5. Are cards actually necessary? 6. Does motion improve hierarchy or atmosphere? 7. Would design feel premium with all decorative shadows removed? Flag any hard rejections: 1. Generic SaaS card grid as first impression 2. Beautiful image with weak brand 3. Strong headline with no clear action 4. Busy imagery behind text 5. Sections repeating same mood statement 6. Carousel with no narrative purpose 7. App UI made of stacked cards instead of layout 5 most important design findings only. Reference file:line." -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_DRL" +codex exec "Review the git diff on this branch. Run 7 litmus checks (YES/NO each): 1. Brand/product unmistakable in first screen? 2. One strong visual anchor present? 3. Page understandable by scanning headlines only? 4. Each section has one job? 5. Are cards actually necessary? 6. Does motion improve hierarchy or atmosphere? 7. Would design feel premium with all decorative shadows removed? Flag any hard rejections: 1. Generic SaaS card grid as first impression 2. Beautiful image with weak brand 3. Strong headline with no clear action 4. Busy imagery behind text 5. Sections repeating same mood statement 6. Carousel with no narrative purpose 7. App UI made of stacked cards instead of layout 5 most important design findings only. Reference file:line." -s read-only -c 'model_reasoning_effort="high"' --search 2>"$TMPERR_DRL" ``` Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr: @@ -802,7 +802,7 @@ Claude's structured review already ran. Now add a **cross-model adversarial chal ```bash TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX) -codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV" +codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="high"' --search 2>"$TMPERR_ADV" ``` Set the Bash tool's `timeout` parameter to `300000` (5 minutes). Do NOT use the `timeout` shell command — it doesn't exist on macOS. After the command completes, read stderr: @@ -847,11 +847,11 @@ Claude's structured review already ran. Now run **all three remaining passes** f **1. Codex structured review (if available):** ```bash TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX) -codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex exec review --base -c 'model_reasoning_effort="high"' --json 2>"$TMPERR" ``` Set the Bash tool's `timeout` parameter to `300000` (5 minutes). Do NOT use the `timeout` shell command — it doesn't exist on macOS. Present output under `CODEX SAYS (code review):` header. -Check for `[P1]` markers: found → `GATE: FAIL`, not found → `GATE: PASS`. +Check for `[P0]` or `[P1]` markers: found → `GATE: FAIL`, not found → `GATE: PASS`. If GATE is FAIL, use AskUserQuestion: ``` @@ -861,7 +861,7 @@ A) Investigate and fix now (recommended) B) Continue — review will still complete ``` -If A: address the findings. Re-run `codex review` to verify. +If A: address the findings. Re-run the codex review to verify. Read stderr for errors (same error handling as medium tier). diff --git a/scripts/gen-skill-docs.ts b/scripts/gen-skill-docs.ts index e23bb532b..d75a67a27 100644 --- a/scripts/gen-skill-docs.ts +++ b/scripts/gen-skill-docs.ts @@ -933,7 +933,7 @@ If Codex is available, run a lightweight design check on the diff: \`\`\`bash TMPERR_DRL=$(mktemp /tmp/codex-drl-XXXXXXXX) -codex exec "Review the git diff on this branch. Run 7 litmus checks (YES/NO each): ${litmusList} Flag any hard rejections: ${rejectionList} 5 most important design findings only. Reference file:line." -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_DRL" +codex exec "Review the git diff on this branch. Run 7 litmus checks (YES/NO each): ${litmusList} Flag any hard rejections: ${rejectionList} 5 most important design findings only. Reference file:line." -s read-only -c 'model_reasoning_effort="high"' --search 2>"$TMPERR_DRL" \`\`\` Use a 5-minute timeout (\`timeout: 300000\`). After the command completes, read stderr: @@ -2143,7 +2143,7 @@ If user chooses A, launch both voices simultaneously: 1. **Codex** (via Bash, \`model_reasoning_effort="medium"\`): \`\`\`bash TMPERR_SKETCH=$(mktemp /tmp/codex-sketch-XXXXXXXX) -codex exec "For this product approach, provide: a visual thesis (one sentence — mood, material, energy), a content plan (hero → support → detail → CTA), and 2 interaction ideas that change page feel. Apply beautiful defaults: composition-first, brand-first, cardless, poster not document. Be opinionated." -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached 2>"$TMPERR_SKETCH" +codex exec "For this product approach, provide: a visual thesis (one sentence — mood, material, energy), a content plan (hero → support → detail → CTA), and 2 interaction ideas that change page feel. Apply beautiful defaults: composition-first, brand-first, cardless, poster not document. Be opinionated." -s read-only -c 'model_reasoning_effort="medium"' --search 2>"$TMPERR_SKETCH" \`\`\` Use a 5-minute timeout (\`timeout: 300000\`). After completion: \`cat "$TMPERR_SKETCH" && rm -f "$TMPERR_SKETCH"\` @@ -2202,7 +2202,7 @@ Write the full prompt (context block + instructions) to this file. Use the mode- \`\`\`bash TMPERR_OH=$(mktemp /tmp/codex-oh-err-XXXXXXXX) -codex exec "$(cat "$CODEX_PROMPT_FILE")" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_OH" +codex exec "$(cat "$CODEX_PROMPT_FILE")" -s read-only -c 'model_reasoning_effort="high"' --search 2>"$TMPERR_OH" \`\`\` Use a 5-minute timeout (\`timeout: 300000\`). After the command completes, read stderr: @@ -2286,7 +2286,7 @@ Claude's structured review already ran. Now add a **cross-model adversarial chal \`\`\`bash TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX) -codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV" +codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="high"' --search 2>"$TMPERR_ADV" \`\`\` Set the Bash tool's \`timeout\` parameter to \`300000\` (5 minutes). Do NOT use the \`timeout\` shell command — it doesn't exist on macOS. After the command completes, read stderr: @@ -2331,11 +2331,11 @@ Claude's structured review already ran. Now run **all three remaining passes** f **1. Codex structured review (if available):** \`\`\`bash TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX) -codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex exec review --base -c 'model_reasoning_effort="high"' --json 2>"$TMPERR" \`\`\` Set the Bash tool's \`timeout\` parameter to \`300000\` (5 minutes). Do NOT use the \`timeout\` shell command — it doesn't exist on macOS. Present output under \`CODEX SAYS (code review):\` header. -Check for \`[P1]\` markers: found → \`GATE: FAIL\`, not found → \`GATE: PASS\`. +Check for \`[P0]\` or \`[P1]\` markers: found → \`GATE: FAIL\`, not found → \`GATE: PASS\`. If GATE is FAIL, use AskUserQuestion: \`\`\` @@ -2345,7 +2345,7 @@ A) Investigate and fix now (recommended) B) Continue — review will still complete \`\`\` -If A: address the findings${isShip ? '. After fixing, re-run tests (Step 3) since code has changed' : ''}. Re-run \`codex review\` to verify. +If A: address the findings${isShip ? '. After fixing, re-run tests (Step 3) since code has changed' : ''}. Re-run the codex review to verify. Read stderr for errors (same error handling as medium tier). @@ -2441,7 +2441,7 @@ THE PLAN: \`\`\`bash TMPERR_PV=$(mktemp /tmp/codex-planreview-XXXXXXXX) -codex exec "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_PV" +codex exec "" -s read-only -c 'model_reasoning_effort="high"' --search 2>"$TMPERR_PV" \`\`\` Use a 5-minute timeout (\`timeout: 300000\`). After the command completes, read stderr: @@ -2705,7 +2705,7 @@ which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE" 1. **Codex design voice** (via Bash): \`\`\`bash TMPERR_DESIGN=$(mktemp /tmp/codex-design-XXXXXXXX) -codex exec "${escapedCodexPrompt}" -s read-only -c 'model_reasoning_effort="${reasoningEffort}"' --enable web_search_cached 2>"$TMPERR_DESIGN" +codex exec "${escapedCodexPrompt}" -s read-only -c 'model_reasoning_effort="${reasoningEffort}"' --search 2>"$TMPERR_DESIGN" \`\`\` Use a 5-minute timeout (\`timeout: 300000\`). After the command completes, read stderr: \`\`\`bash diff --git a/ship/SKILL.md b/ship/SKILL.md index b79dc5374..691447962 100644 --- a/ship/SKILL.md +++ b/ship/SKILL.md @@ -1083,7 +1083,7 @@ If Codex is available, run a lightweight design check on the diff: ```bash TMPERR_DRL=$(mktemp /tmp/codex-drl-XXXXXXXX) -codex exec "Review the git diff on this branch. Run 7 litmus checks (YES/NO each): 1. Brand/product unmistakable in first screen? 2. One strong visual anchor present? 3. Page understandable by scanning headlines only? 4. Each section has one job? 5. Are cards actually necessary? 6. Does motion improve hierarchy or atmosphere? 7. Would design feel premium with all decorative shadows removed? Flag any hard rejections: 1. Generic SaaS card grid as first impression 2. Beautiful image with weak brand 3. Strong headline with no clear action 4. Busy imagery behind text 5. Sections repeating same mood statement 6. Carousel with no narrative purpose 7. App UI made of stacked cards instead of layout 5 most important design findings only. Reference file:line." -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_DRL" +codex exec "Review the git diff on this branch. Run 7 litmus checks (YES/NO each): 1. Brand/product unmistakable in first screen? 2. One strong visual anchor present? 3. Page understandable by scanning headlines only? 4. Each section has one job? 5. Are cards actually necessary? 6. Does motion improve hierarchy or atmosphere? 7. Would design feel premium with all decorative shadows removed? Flag any hard rejections: 1. Generic SaaS card grid as first impression 2. Beautiful image with weak brand 3. Strong headline with no clear action 4. Busy imagery behind text 5. Sections repeating same mood statement 6. Carousel with no narrative purpose 7. App UI made of stacked cards instead of layout 5 most important design findings only. Reference file:line." -s read-only -c 'model_reasoning_effort="high"' --search 2>"$TMPERR_DRL" ``` Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr: @@ -1198,7 +1198,7 @@ Claude's structured review already ran. Now add a **cross-model adversarial chal ```bash TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX) -codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV" +codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="high"' --search 2>"$TMPERR_ADV" ``` Set the Bash tool's `timeout` parameter to `300000` (5 minutes). Do NOT use the `timeout` shell command — it doesn't exist on macOS. After the command completes, read stderr: @@ -1243,11 +1243,11 @@ Claude's structured review already ran. Now run **all three remaining passes** f **1. Codex structured review (if available):** ```bash TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX) -codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex exec review --base -c 'model_reasoning_effort="high"' --json 2>"$TMPERR" ``` Set the Bash tool's `timeout` parameter to `300000` (5 minutes). Do NOT use the `timeout` shell command — it doesn't exist on macOS. Present output under `CODEX SAYS (code review):` header. -Check for `[P1]` markers: found → `GATE: FAIL`, not found → `GATE: PASS`. +Check for `[P0]` or `[P1]` markers: found → `GATE: FAIL`, not found → `GATE: PASS`. If GATE is FAIL, use AskUserQuestion: ``` @@ -1257,7 +1257,7 @@ A) Investigate and fix now (recommended) B) Continue — review will still complete ``` -If A: address the findings. After fixing, re-run tests (Step 3) since code has changed. Re-run `codex review` to verify. +If A: address the findings. After fixing, re-run tests (Step 3) since code has changed. Re-run the codex review to verify. Read stderr for errors (same error handling as medium tier). diff --git a/test/skill-validation.test.ts b/test/skill-validation.test.ts index c4bc99afe..de5cb7d5c 100644 --- a/test/skill-validation.test.ts +++ b/test/skill-validation.test.ts @@ -1332,7 +1332,7 @@ describe('Codex skill', () => { expect(content).toContain('fall back to the Claude adversarial subagent'); // Review log uses new skill name expect(content).toContain('adversarial-review'); - expect(content).toContain('xhigh'); + expect(content).toContain('model_reasoning_effort="high"'); expect(content).toContain('ADVERSARIAL REVIEW SYNTHESIS'); }); @@ -1342,7 +1342,7 @@ describe('Codex skill', () => { expect(content).toContain('< 50'); expect(content).toContain('200+'); expect(content).toContain('adversarial-review'); - expect(content).toContain('xhigh'); + expect(content).toContain('model_reasoning_effort="high"'); expect(content).toContain('Investigate and fix'); });