Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
563a580
issue/baseline-instrumentation-coding-loop: Add trivial-flagging and …
AbirAbbas Feb 23, 2026
84efc64
issue/baseline-instrumentation-dag-metrics: Add agent metrics and tim…
AbirAbbas Feb 23, 2026
ff4f6d8
issue/baseline-instrumentation-app-timing: Add build phase timing ins…
AbirAbbas Feb 23, 2026
f3c7a0a
issue/per-role-turn-budget-schema: Add per-role turn limits and timeo…
AbirAbbas Feb 23, 2026
cf82442
issue/pass-rate-validation-baseline-script: Create benchmark suite ru…
AbirAbbas Feb 23, 2026
2169da0
issue/prompt-compression-coder: Compress coder.py prompt from 235 to …
AbirAbbas Feb 23, 2026
895ce8f
issue/prompt-compression-replanner: Compress replanner.py from 227 to…
AbirAbbas Feb 23, 2026
90bc5cf
issue/prompt-compression-sprint-planner: Compress sprint_planner.py b…
AbirAbbas Feb 23, 2026
ed447e4
issue/prompt-compression-verifier: Compress verifier.py by 32% (216→1…
AbirAbbas Feb 23, 2026
ecd9d19
issue/prompt-compression-issue-advisor: Compress issue_advisor.py by 25%
AbirAbbas Feb 23, 2026
1d44b4c
issue/trivial-schema-field: Add trivial field to IssueGuidance schema
AbirAbbas Feb 23, 2026
7535004
issue/pm-atomic-write: Implement atomic PRD file write in Product Man…
AbirAbbas Feb 23, 2026
2a0548c
issue/architect-prd-polling: Add PRD polling logic to Architect promp…
AbirAbbas Feb 23, 2026
526a25b
issue/advisor-confidence-escalation: Add low-confidence escalation to…
AbirAbbas Feb 23, 2026
189a6fd
issue/issue-advisor-iteration-threshold: Add iteration threshold gati…
AbirAbbas Feb 23, 2026
31b7853
issue/pm-atomic-write: Implement atomic PRD write in Product Manager
AbirAbbas Feb 23, 2026
20788ac
issue/advisor-confidence-escalation: Add low-confidence escalation to…
AbirAbbas Feb 23, 2026
5a72d12
issue/max-coding-iterations-increase: Increase max_coding_iterations …
AbirAbbas Feb 23, 2026
f187c64
issue/replanner-downstream-gate: Add downstream count gating to Repla…
AbirAbbas Feb 23, 2026
eef9d64
issue/model-downgrade-git-merger-issue-writer: Downgrade git, merger,…
AbirAbbas Feb 23, 2026
1a55aa5
issue/daffc8a2-18-model-downgrade-git-merger-issue-writer: Downgrade …
AbirAbbas Feb 23, 2026
f9701a1
Merge issue/daffc8a2-01-baseline-instrumentation-app-timing: Add stru…
AbirAbbas Feb 23, 2026
bccbd5e
Merge issue/daffc8a2-02-baseline-instrumentation-dag-metrics: Add age…
AbirAbbas Feb 23, 2026
3f29ea4
Merge issue/daffc8a2-03-baseline-instrumentation-coding-loop: Add tri…
AbirAbbas Feb 23, 2026
6553ba5
Merge issue/daffc8a2-04-per-role-turn-budget-schema: Add per-role tur…
AbirAbbas Feb 23, 2026
8ce9a09
Merge issue/daffc8a2-05-pass-rate-validation-baseline-script: Add ben…
AbirAbbas Feb 23, 2026
305aea1
Merge issue/daffc8a2-06-prompt-compression-sprint-planner: Compress s…
AbirAbbas Feb 23, 2026
5da183f
Merge issue/daffc8a2-07-prompt-compression-coder: Compress coder.py f…
AbirAbbas Feb 23, 2026
d1971dd
Merge issue/daffc8a2-08-prompt-compression-replanner: Compress replan…
AbirAbbas Feb 23, 2026
240d984
Merge issue/daffc8a2-09-prompt-compression-issue-advisor: Compress is…
AbirAbbas Feb 23, 2026
ac8fd76
Merge issue/daffc8a2-10-prompt-compression-verifier: Compress verifie…
AbirAbbas Feb 23, 2026
5334d69
Merge issue/daffc8a2-11-pm-atomic-write: Add atomic file write patter…
AbirAbbas Feb 23, 2026
f713c8d
Merge issue/daffc8a2-12-architect-prd-polling: Add PRD polling with e…
AbirAbbas Feb 23, 2026
e632276
Merge issue/daffc8a2-13-trivial-schema-field: Add trivial field to Is…
AbirAbbas Feb 23, 2026
250ec7c
Merge issue/daffc8a2-14-issue-advisor-iteration-threshold: Gate Issue…
AbirAbbas Feb 23, 2026
03b4aec
Resolve conflict: Use iteration_count for cumulative attempts in conf…
AbirAbbas Feb 23, 2026
59a0bbe
Merge issue/daffc8a2-16-max-coding-iterations-increase: Increase max_…
AbirAbbas Feb 23, 2026
71d8dc9
Merge issue/daffc8a2-17-replanner-downstream-gate: Add downstream cou…
AbirAbbas Feb 23, 2026
6d64a38
Resolve conflict: Use unified test format and keep comprehensive unit…
AbirAbbas Feb 23, 2026
5dfe37f
issue/sprint-planner-trivial-heuristic: Add trivial-flagging heuristi…
AbirAbbas Feb 23, 2026
7f49da0
issue/callsite-updates-per-role-budgets: Update agent callsites to us…
AbirAbbas Feb 23, 2026
adcd49b
issue/coding-loop-trivial-fast-path: Implement trivial fast-path in c…
AbirAbbas Feb 23, 2026
bce02bd
issue/planning-parallelization-refactor: Parallelize PM and Architect…
AbirAbbas Feb 23, 2026
4c7c3d2
Merge issue/daffc8a2-20-planning-parallelization-refactor: Refactor a…
AbirAbbas Feb 23, 2026
3a250fe
Merge issue/daffc8a2-21-sprint-planner-trivial-heuristic: Enhance spr…
AbirAbbas Feb 23, 2026
d74536a
Merge issue/daffc8a2-22-coding-loop-trivial-fast-path: Add fast-path …
AbirAbbas Feb 23, 2026
930389b
issue/model-downgrade-sprint-planner: Downgrade sprint_planner_model …
AbirAbbas Feb 23, 2026
68cfdca
Merge issue/daffc8a2-23-model-downgrade-sprint-planner: Add sprint_pl…
AbirAbbas Feb 24, 2026
79b343b
issue/integration-benchmark-pass-rate-validation: Implement comprehen…
AbirAbbas Feb 24, 2026
bfd5aba
issue/integration-benchmark-pass-rate-validation: Fix benchmark suite…
AbirAbbas Feb 24, 2026
f1dc208
Merge issue/daffc8a2-24-integration-benchmark-pass-rate-validation: i…
AbirAbbas Feb 24, 2026
5a2b943
issue/integration-build-time-measurement: Add build time measurement …
AbirAbbas Feb 24, 2026
d845ef7
issue/integration-build-time-measurement: Add build time measurement …
AbirAbbas Feb 24, 2026
27f128d
Merge issue/daffc8a2-25-integration-build-time-measurement: integrati…
AbirAbbas Feb 24, 2026
bfe95aa
chore: add integration test files created during development
AbirAbbas Feb 24, 2026
a5c3281
issue/fix-sprint-planner-prompt-compression: Compress sprint_planner.…
AbirAbbas Feb 24, 2026
32ff4e2
issue/fix-default-agent-max-turns-constant: Remove DEFAULT_AGENT_MAX_…
AbirAbbas Feb 24, 2026
51eecdb
issue/fix-agent-timeout-seconds-constant: Remove hardcoded 2700 default
AbirAbbas Feb 24, 2026
0298607
issue/fix-coding-loop-timeout-undefined: Remove unused timeout parame…
AbirAbbas Feb 24, 2026
7a3d8d4
issue/fix-test-model-selection-filename: Rename test_model_downgrade.…
AbirAbbas Feb 24, 2026
9252963
issue/fix-planning-parallelization-sprint-planner: Parallelize Sprint…
AbirAbbas Feb 24, 2026
99a80a2
issue/fix-agent-timeout-seconds-constant: Remove hardcoded 2700 defau…
AbirAbbas Feb 24, 2026
5efced0
issue/fix-build-cost-comparison-script: Create compare_build_costs.py…
AbirAbbas Feb 24, 2026
7a1d2fb
Resolve conflict: Remove DEFAULT_AGENT_MAX_TURNS constant, use litera…
AbirAbbas Feb 24, 2026
1d61162
Resolve conflict: Keep literal timeout value 1800, avoid DEFAULT_AGEN…
AbirAbbas Feb 24, 2026
b7101b5
Merge issue/00-fix-sprint-planner-prompt-compression: Compress sprint…
AbirAbbas Feb 24, 2026
13d621c
Merge issue/00-fix-planning-parallelization-sprint-planner: Paralleli…
AbirAbbas Feb 24, 2026
8c2d215
Merge issue/00-fix-coding-loop-timeout-undefined: Remove unused timeo…
AbirAbbas Feb 24, 2026
d4cd006
Merge issue/00-fix-test-model-selection-filename: Rename test_model_d…
AbirAbbas Feb 24, 2026
747e1ad
Merge issue/00-fix-build-cost-comparison-script: Add compare_build_co…
AbirAbbas Feb 24, 2026
1a8d8fc
issue/validate-advisor-gate: Validate issue advisor invocation gate e…
AbirAbbas Feb 24, 2026
13e0825
Merge issue/fbb4103b-02-validate-advisor-gate: Validate advisor gate
AbirAbbas Feb 24, 2026
6959b19
issue/fix-timeouts: Reduce per-role timeouts for lightweight agents
AbirAbbas Feb 24, 2026
1a52fe6
issue/fix-turn-budgets: Reduce per-role turn budgets in ExecutionConfig
AbirAbbas Feb 24, 2026
bde0dbc
issue/fix-timeouts: Reduce per-role timeouts in ExecutionConfig
AbirAbbas Feb 24, 2026
5a3448a
issue/fix-test-failure-fast-path: Require passing tests for trivial i…
AbirAbbas Feb 24, 2026
d47c80b
issue/compress-system-prompts: Reduce token count from 9946 to 5213
AbirAbbas Feb 24, 2026
4e9fb40
issue/fix-test-failure-fast-path: Update test to match new behavior
AbirAbbas Feb 24, 2026
4a386b1
Merge issue/00-fix-turn-budgets: Reduce turn budgets for lightweight …
AbirAbbas Feb 24, 2026
084781e
Merge issue/00-fix-timeouts: Reduce timeouts for lightweight agents
AbirAbbas Feb 24, 2026
1ffb642
Merge issue/00-compress-system-prompts: Compress all system prompts
AbirAbbas Feb 24, 2026
36bc826
Merge issue/00-fix-test-failure-fast-path: Fix trivial flag test hand…
AbirAbbas Feb 24, 2026
678f11e
chore: finalize repo for handoff
AbirAbbas Feb 24, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
318 changes: 318 additions & 0 deletions scripts/compare_build_costs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,318 @@
#!/usr/bin/env python3
"""Build cost comparison and validation.

Compares baseline build costs with optimized build costs to validate:
- Cost reduction ≥15% from haiku model downgrades (Component 7)
- Used in integration tests to gate Component 7 deployment

The script expects cost data in JSON format with agent durations and model assignments.
Cost reduction is calculated based on relative pricing differences between models.

Usage:
python scripts/compare_build_costs.py --baseline baseline_costs.json --threshold 0.15
python scripts/compare_build_costs.py --baseline examples/baseline_costs.json --threshold 0.20 --current optimized_costs.json
"""

from __future__ import annotations

import argparse
import json
import sys
from pathlib import Path
from typing import Any


# Model cost multipliers (relative to haiku = 1.0)
# Based on typical Claude pricing: Sonnet is ~5x more expensive than Haiku
MODEL_COST_MULTIPLIERS = {
"haiku": 1.0,
"sonnet": 5.0,
"opus": 15.0, # For completeness, though not commonly used in this optimization
}


def load_costs_from_file(file_path: Path) -> dict[str, Any]:
"""Load cost data from JSON file.

Args:
file_path: Path to JSON file containing cost data

Returns:
Dict with agent_durations, model_assignments, and optionally total_cost

Raises:
FileNotFoundError: If file doesn't exist
json.JSONDecodeError: If file is not valid JSON
ValueError: If required fields are missing
"""
if not file_path.exists():
raise FileNotFoundError(f"Cost file not found: {file_path}")

with open(file_path, "r", encoding="utf-8") as f:
data = json.load(f)

# Validate required fields
if "agent_durations" not in data:
raise ValueError(f"Missing required field 'agent_durations' in {file_path}")

# model_assignments is optional - if not present, assume all sonnet
if "model_assignments" not in data:
data["model_assignments"] = {}

return data


def calculate_total_cost(
agent_durations: dict[str, list[float]],
model_assignments: dict[str, str],
default_model: str = "sonnet"
) -> float:
"""Calculate total cost based on agent durations and model assignments.

Cost is calculated as:
total_cost = sum(duration * model_cost_multiplier for each agent call)

Args:
agent_durations: Dict mapping agent role to list of durations (seconds)
model_assignments: Dict mapping agent role to model name
default_model: Default model to use if role not in model_assignments

Returns:
Total cost (arbitrary units, relative to haiku = 1.0)
"""
total_cost = 0.0

for role, durations in agent_durations.items():
# Get model for this role
model = model_assignments.get(role, default_model)
cost_multiplier = MODEL_COST_MULTIPLIERS.get(model, MODEL_COST_MULTIPLIERS["sonnet"])

# Sum up costs for all calls of this role
role_duration = sum(durations)
role_cost = role_duration * cost_multiplier
total_cost += role_cost

return total_cost


def calculate_cost_reduction(baseline_cost: float, current_cost: float) -> float:
"""Calculate cost reduction percentage.

Args:
baseline_cost: Baseline total cost
current_cost: Current/optimized total cost

Returns:
Cost reduction as fraction (e.g., 0.15 = 15% reduction)
Returns 0.0 if baseline_cost is 0
"""
if baseline_cost == 0.0:
return 0.0

reduction = (baseline_cost - current_cost) / baseline_cost
return max(0.0, reduction) # Clamp to non-negative


def format_cost_breakdown(
agent_durations: dict[str, list[float]],
model_assignments: dict[str, str],
default_model: str = "sonnet",
top_n: int = 5
) -> str:
"""Format cost breakdown by role.

Args:
agent_durations: Dict mapping agent role to list of durations
model_assignments: Dict mapping agent role to model name
default_model: Default model to use if role not in assignments
top_n: Number of top cost contributors to show

Returns:
Formatted string with cost breakdown
"""
role_costs = []

for role, durations in agent_durations.items():
model = model_assignments.get(role, default_model)
cost_multiplier = MODEL_COST_MULTIPLIERS.get(model, MODEL_COST_MULTIPLIERS["sonnet"])
role_duration = sum(durations)
role_cost = role_duration * cost_multiplier
num_calls = len(durations)

role_costs.append({
"role": role,
"model": model,
"duration": role_duration,
"cost": role_cost,
"calls": num_calls,
})

# Sort by cost descending
role_costs.sort(key=lambda x: x["cost"], reverse=True)

# Format top N
lines = []
for i, item in enumerate(role_costs[:top_n], 1):
lines.append(
f" {i}. {item['role']:20s} ({item['model']:6s}): "
f"{item['cost']:8.1f} cost units ({item['duration']:6.1f}s, {item['calls']} calls)"
)

total_cost = sum(item["cost"] for item in role_costs)
lines.append(f" {'Total':>29s}: {total_cost:8.1f} cost units")

return "\n".join(lines)


def main():
"""Main entry point for build cost comparison."""
parser = argparse.ArgumentParser(
description="Compare build costs and validate cost reduction"
)
parser.add_argument(
"--baseline",
type=str,
required=True,
help="Path to baseline costs JSON file",
)
parser.add_argument(
"--current",
type=str,
help="Path to current/optimized costs JSON file (optional, defaults to stdin or baseline comparison)",
)
parser.add_argument(
"--threshold",
type=float,
required=True,
help="Minimum cost reduction threshold as fraction (e.g., 0.15 = 15%% reduction required)",
)
parser.add_argument(
"--output",
type=str,
help="Output JSON file path for comparison results (optional)",
)
parser.add_argument(
"--verbose",
action="store_true",
help="Show detailed cost breakdown",
)

args = parser.parse_args()

# Validate threshold
if args.threshold < 0.0 or args.threshold > 1.0:
print("Error: --threshold must be between 0.0 and 1.0", file=sys.stderr)
sys.exit(1)

try:
# Load baseline costs
baseline_path = Path(args.baseline)
print(f"Loading baseline costs from: {args.baseline}", file=sys.stderr)
baseline_data = load_costs_from_file(baseline_path)

# Load current costs
if args.current:
current_path = Path(args.current)
print(f"Loading current costs from: {args.current}", file=sys.stderr)
current_data = load_costs_from_file(current_path)
else:
# If no current file specified, use baseline data but with optimized model assignments
# This allows testing the script with a single baseline file
print("No --current specified, using baseline data", file=sys.stderr)
current_data = baseline_data

# Calculate baseline cost (all agents using sonnet by default)
baseline_cost = calculate_total_cost(
baseline_data["agent_durations"],
baseline_data.get("model_assignments", {}),
default_model="sonnet"
)

# Calculate current cost (with optimized model assignments)
current_cost = calculate_total_cost(
current_data["agent_durations"],
current_data.get("model_assignments", {}),
default_model="sonnet"
)

# Calculate cost reduction
cost_reduction = calculate_cost_reduction(baseline_cost, current_cost)

# Prepare results
results = {
"baseline_cost": baseline_cost,
"current_cost": current_cost,
"cost_reduction": cost_reduction,
"threshold": args.threshold,
"passed": cost_reduction >= args.threshold,
}

# Save results if output specified
if args.output:
output_path = Path(args.output)
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w", encoding="utf-8") as f:
json.dump(results, f, indent=2)
print(f"Results written to: {args.output}", file=sys.stderr)

# Print comparison summary
print(f"\n{'='*60}", file=sys.stderr)
print(f"Build Cost Comparison", file=sys.stderr)
print(f"{'='*60}", file=sys.stderr)
print(f" Baseline cost: {baseline_cost:10.1f} cost units", file=sys.stderr)
print(f" Current cost: {current_cost:10.1f} cost units", file=sys.stderr)
print(f" Cost reduction: {cost_reduction:10.2%} ({baseline_cost - current_cost:.1f} cost units)", file=sys.stderr)
print(f" Threshold: {args.threshold:10.2%}", file=sys.stderr)
print(f"", file=sys.stderr)

# Show detailed breakdown if verbose
if args.verbose:
print(f"Baseline Cost Breakdown (top 5 roles):", file=sys.stderr)
breakdown_baseline = format_cost_breakdown(
baseline_data["agent_durations"],
baseline_data.get("model_assignments", {}),
default_model="sonnet"
)
print(breakdown_baseline, file=sys.stderr)
print(f"", file=sys.stderr)

print(f"Current Cost Breakdown (top 5 roles):", file=sys.stderr)
breakdown_current = format_cost_breakdown(
current_data["agent_durations"],
current_data.get("model_assignments", {}),
default_model="sonnet"
)
print(breakdown_current, file=sys.stderr)
print(f"", file=sys.stderr)

# Determine status
if cost_reduction >= args.threshold:
print(f"✓ PASS: Cost reduction {cost_reduction:.2%} >= threshold {args.threshold:.2%}", file=sys.stderr)
print(f" Component 7 (Model Downgrade) APPROVED for deployment", file=sys.stderr)
print(f"{'='*60}", file=sys.stderr)
sys.exit(0)
else:
print(f"✗ FAIL: Cost reduction {cost_reduction:.2%} < threshold {args.threshold:.2%}", file=sys.stderr)
print(f" Component 7 (Model Downgrade) REJECTED - cost savings insufficient", file=sys.stderr)
shortfall = args.threshold - cost_reduction
print(f" Shortfall: {shortfall:.2%} ({shortfall * baseline_cost:.1f} cost units)", file=sys.stderr)
print(f" ACTION: Review model assignments or adjust threshold", file=sys.stderr)
print(f"{'='*60}", file=sys.stderr)
sys.exit(1)

except FileNotFoundError as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
except json.JSONDecodeError as e:
print(f"Error: Invalid JSON in file: {e}", file=sys.stderr)
sys.exit(1)
except ValueError as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"Unexpected error: {e}", file=sys.stderr)
sys.exit(1)


if __name__ == "__main__":
main()
Loading
Loading