Skip to content

Comments

Fix CPU/Energy regressions for issue #139#402

Open
bald-ai wants to merge 1 commit intosteipete:mainfrom
bald-ai:codex/perf-issue-139
Open

Fix CPU/Energy regressions for issue #139#402
bald-ai wants to merge 1 commit intosteipete:mainfrom
bald-ai:codex/perf-issue-139

Conversation

@bald-ai
Copy link

@bald-ai bald-ai commented Feb 19, 2026

Fixes #139

This is my first PR on this codebase, and I tried to be diligent and explicit about what I changed and how I validated it.

What was causing the high CPU/energy usage

The issue was not one single bug; it was three independent performance problems that could stack:

  1. Codex CLI failure path could stay active too long (main culprit).
  2. OpenAI dashboard web fetch could spend too long retrying under bad auth/cookies.
  3. Menu bar blink task woke too often while idle.

What this PR changes

1) Main culprit: Codex CLI failed-run window is now short and bounded

File: Sources/CodexBarCore/Providers/Codex/CodexStatusProbe.swift

  • Reduced default timeout from 18s to 8s.
  • Changed retry policy:
    • before: retry on .parseFailed and .timedOut
    • now: retry only on .parseFailed
  • Added short parse retry timeout (4s).

Result: bad CLI states fail fast and wait for next scheduled refresh instead of burning CPU for long windows.

2) OpenAI web dashboard fetch timeouts are capped lower

File: Sources/CodexBar/UsageStore.swift

  • Primary timeout: 15s
  • Retry timeout: 8s

Result: bad web session/cookie cases stop much sooner.

3) Idle blink loop is now adaptive

File: Sources/CodexBar/StatusItemController+Animation.swift

  • Removed fixed 75ms wakeups while idle.
  • Keep 75ms cadence only during active blink animation.

Result: less idle wakeup noise in normal usage.

4) Documentation updates

  • Updated Codex provider behavior notes: docs/codex.md
  • Added pre-fix simulation report: docs/perf-energy-issue-139-simulation-report-2026-02-19.md
  • Added post-fix validation report: docs/perf-energy-issue-139-main-fix-validation-2026-02-19.md

Measured impact (before vs after)

Main culprit comparison (Codex CLI failed path):

Metric Before After Delta
Failed-run window 42.00s (18+24 code-path budget) 12.67s measured mean -69.8%
Avg child CPU during failed run 113.32% 89.34% -21.2%
CPU-time exposure (CPU * duration) 4759.44 1132.94 -76.2%
Leftover child processes after failed run not captured pre-fix 0 improved

Validation run

Commands executed:

  • ./Scripts/lint.sh format
  • ./Scripts/lint.sh lint (strict swiftlint)
  • swift test
  • pnpm check
  • ./Scripts/compile_and_run.sh

All passed.

Attachments / transparency

I will upload these to the PR thread:

  • Activity Monitor before/after screenshots (CPU/Energy impact).
  • ai_conversation_full.jsonl (the conversation where I finalized and implemented this fix end-to-end).

I had earlier exploration chats too, but this attached one is the conversation where the final implementation was locked in.

AI assistance disclosure

This PR was prepared with AI assistance (analysis, implementation support, and test/report drafting), with manual review and validation by me before submission.

@bald-ai
Copy link
Author

bald-ai commented Feb 19, 2026

ai_conversation_full.zip

@ratulsarna
Copy link
Collaborator

A question before we merge: can you share how you landed on the new timeout values (8s/4s for Codex CLI and 15s/8s for OpenAI web), and whether you saw any increased “no data/stale” behavior in slower/flaky conditions? I’m aligned with the faster-fail direction, just want to explicitly confirm the tradeoff we’re accepting.

@ratulsarna ratulsarna added the question Further information is requested label Feb 20, 2026
@bald-ai
Copy link
Author

bald-ai commented Feb 20, 2026

Hi, I wrote answer and let AI format it for easier readability. If you want me to do some precise testing, no problem but you will have to tell me what exactly you need. Also feel free to change the numbers if you can make better guess. Ideally it would be better to get from Peter some better way to get the info without launching the entire CLI. Maybe he can hook you with solution from big token?


AI answer

Hi, I honestly didn’t know the exact “correct” way to pick those timeout values, so I made a practical guess to get the
best fix without accidentally removing intended behaviour.

Reason:

  • Happy-path runs are usually in seconds.
  • These values still allow retries.
  • They hard-cap bad loops so we don’t burn CPU forever.

What I saw in my logs:

Codex RPC (happy path):

  • Median: 1.10s
  • P95: 2.95s
  • Max: 11.26s (rare outlier)

OpenAI web refresh (Feb 20, 2026):

  • Median: 3.40s
  • P95: 4.08s
  • Max: 8.65s
  • Runs over 8s: 1/35

No-data / stale behavior:

  • I didn’t run a dedicated flaky-network benchmark.
  • I did set up runtime logging yesterday.
  • Total samples: 767
  • Healthy samples: 754
  • Overall healthy rate: 98.31%
  • Feb 20 healthy rate: 99.79% (474/475)
  • Last 120 samples healthy rate: 99.17% (119/120)

So from what I collected, failures looked like short blips, not long degraded periods.

@bald-ai
Copy link
Author

bald-ai commented Feb 20, 2026

Ou btw the loggs are in my private version, I shipped it without and made custom version for myself. That felt like right way to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

question Further information is requested

Projects

None yet

Development

Successfully merging this pull request may close these issues.

consuming too much power on my macbook

2 participants