Skip to content

fix: always capture service logs during startup and show on failure#202

Merged
sohil-kshirsagar merged 10 commits intomainfrom
sohil/fix/service-logging-improvements
Mar 22, 2026
Merged

fix: always capture service logs during startup and show on failure#202
sohil-kshirsagar merged 10 commits intomainfrom
sohil/fix/service-logging-improvements

Conversation

@sohil-kshirsagar
Copy link
Contributor

@sohil-kshirsagar sohil-kshirsagar commented Mar 20, 2026

Summary

Service logs are now always captured during startup and automatically shown when the service fails to start. Previously, users had no visibility into startup failures unless they passed --enable-service-logs. This was especially painful in CI where iteration cycles are slow.

Also adds early process exit detection, print mode duration logging, TUI debug log cleanup, and fixes a pre-existing crash recovery bug.

image image

Changes

Always capture service logs during startup

  • Service stdout/stderr is always captured during startup — in an in-memory buffer (default) or to a file (--enable-service-logs)
  • On startup failure: logs are shown in the TUI service panel or printed to stderr in print mode
  • On startup success: in-memory buffer is discarded (switched to io.Discard) to avoid unbounded memory growth
  • Buffer is thread-safe (syncBuffer with mutex) since both stdout and stderr write concurrently
  • Logs persist across sandbox retry with a ⚠️ Retrying without sandbox... separator

Early process exit detection

  • After serviceCmd.Start(), a goroutine monitors serviceCmd.Wait() via processExitCh
  • waitForReadiness() uses select on processExitCh vs time.After — fails immediately instead of waiting for the full readiness timeout
  • Error message: "service process exited during startup: exit status 1"

Print mode duration logging

  • ➤ Loaded N tests (Xs) — test loading time
  • ↳ Loaded N suite spans (...) (Xs) — span preparation time
  • ✓ Environment ready (Xs) — server + service + SDK ack time
  • Total elapsed: Xs — overall time

TUI improvements

  • Custom slog handler (tuiSlogHandler): Formats debug slog output as clean message key=value instead of raw time=... level=DEBUG msg=...
  • stderr redirect (redirectStderrToTUI): Captures fence library's direct fmt.Fprintf(os.Stderr, ...) calls and routes them to the service log panel, preventing alt screen corruption
  • Unpatched dependency alerts: Downgraded from log.Info (always shown with full stack trace in service logs) to log.TestDebug (debug-only, routed to test panel)
  • Crash recovery fix: When the crashed test was the last in the batch, the server restart was skipped because hasMoreTests didn't account for tests queued for retry. Fixed by checking len(m.testsToRetry) > 0.

Print mode fallback for TestLog/ServiceLog

  • When no TUI is active, handleLogMessage routes through slog.Debug instead of silently dropping
  • --debug print mode now shows test-level log messages alongside other debug output

New log.TestDebug function

  • Debug-level variant of TestLog — only emits when --debug is enabled
  • Used for unpatched dependency alerts

Help message simplified

  • Removed the "Use --enable-service-logs" tip (logs are now always captured)
  • Only shows log file path when --enable-service-logs is active

Setup agent prompt updates

  • Startup logs are always shown on failure — no need for debug: true to see them
  • debug: true guidance preserved for runtime/replay failures
  • SDK logLevel: "debug" guidance added for SDK-level diagnostics
  • Sandbox fallback: agent adds a comment in config.yaml explaining why sandbox was disabled

TUI testing doc

  • Doc explaining how to use coding agents to test the TUI, capture screenshots, iterate

Future work

  • Debug log routing to test panels: In --debug TUI mode, log.Debug() calls with traceID attrs still route to service logs instead of the corresponding test panel. Fix: convert ~20 callsites in comparison.go, mock_matcher.go, dynamic_fields.go, server.go from log.Debug("...", "traceID", id) to log.TestDebug(id, "...")
  • SDK debug log level via CLI: When --debug is used, propagate debug log level to SDKs via env var so users don't need to manually set logLevel: "debug"
  • Deviation analysis with service logs: Interlace test markers ("X trace test started" / "X trace test completed") in service logs so service output can be programmatically extracted per-test and used in deviation analysis

@sohil-kshirsagar sohil-kshirsagar changed the title Sohil/fix/service logging improvements fix: always capture service logs during startup and show on failure Mar 20, 2026
@sohil-kshirsagar sohil-kshirsagar marked this pull request as ready for review March 20, 2026 21:22
@sohil-kshirsagar sohil-kshirsagar requested a review from jy-tan March 20, 2026 21:22
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 15 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="internal/runner/environment_replay.go">

<violation number="1" location="internal/runner/environment_replay.go:54">
P1: Startup failure log lines are not shown in non-debug print mode because they are sent only to `ServiceLog`.</violation>
</file>

<file name="internal/tui/test_executor.go">

<violation number="1" location="internal/tui/test_executor.go:297">
P2: The stderr pipe reader assumes each `Read` returns whole lines, so log lines split across chunks are broken into incorrect entries. Buffer incomplete trailing data between reads before splitting on `\n`.</violation>
</file>

<file name="internal/runner/service.go">

<violation number="1" location="internal/runner/service.go:177">
P2: The goroutine captures `e.processExitCh` by reference through `e`. During sandbox retry, if the old process hasn't fully exited when the new `StartService()` runs and creates a new channel (`e.processExitCh = make(chan error, 1)`), the old goroutine's deferred send will target the **new** channel, causing a spurious early-exit detection for the new service. Capture the channel in a local variable before launching the goroutine to eliminate this race.</violation>
</file>

<file name="internal/runner/environment.go">

<violation number="1" location="internal/runner/environment.go:29">
P2: `savedLogFile` will be `nil` when the first `StartService()` fails at readiness, because `StartService()` internally calls `StopService()` (which nils `e.serviceLogFile`) before returning the error. The subsequent `e.serviceLogFile = nil` and restore are no-ops. Unlike the in-memory buffer path (which has a `e.startupLogBuffer == nil` guard to preserve across retries), the file path has no equivalent guard — `setupServiceLogging` unconditionally creates a new log file, losing the first attempt's logs.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@tusk-dev
Copy link

tusk-dev bot commented Mar 20, 2026

Generated 25 tests - 25 passed

Commit tests View tests

Tip

New to Tusk Unit Tests? Learn more here.

You are using "Only surface failing tests" mode, so Tusk will only surface failing tests in the PR comment but you can still see any passing tests in the Tusk UI.

Considered 25 test scenarios and 11 symbols in 4 files but found no failing tests.

Avg +47% line coverage gain across 4 files
Source file Line Branch
internal/log/log.go 54% (+54%) 0%
internal/runner/executor.go 57% (+57%) 0%
internal/runner/service.go 72% (+72%) 0%
internal/tui/test_executor.go 4% (+4%) 0%

Coverage is calculated by running tests (including passing) directly associated with each source file, learn more here.

View check history

Commit Status Output Created (UTC)
5d1d704 Generated 25 tests - 25 passed Tests Mar 20, 2026 9:23PM

Was Tusk helpful? Give feedback by reacting with 👍 or 👎

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 15 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="internal/runner/service_test_helpers_windows.go">

<violation number="1" location="internal/runner/service_test_helpers_windows.go:15">
P2: The ping-based sleep uses `duration` instead of `duration+1`, so this helper under-sleeps by about one second and can make timing-dependent tests flaky.</violation>
</file>

<file name="internal/runner/environment.go">

<violation number="1" location="internal/runner/environment.go:29">
P2: The sandbox-retry path does not actually preserve file-based startup logs: `StartService()` recreates `serviceLogFile` on retry, so first-attempt logs/separator can be lost instead of accumulated.</violation>
</file>

<file name="internal/tui/test_executor.go">

<violation number="1" location="internal/tui/test_executor.go:301">
P2: Pipe reads are chunk-based, so splitting each read directly by `\n` can emit partial log lines. Preserve incomplete trailing data between reads (or use a line scanner) before appending to service logs.</violation>
</file>

<file name="internal/runner/service.go">

<violation number="1" location="internal/runner/service.go:178">
P1: Capture the started command in a local variable before launching the monitor goroutine; using `e.serviceCmd` inside the closure can race with retries and wait on the wrong process.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@sohil-kshirsagar sohil-kshirsagar merged commit 2c0bbad into main Mar 22, 2026
14 checks passed
@sohil-kshirsagar sohil-kshirsagar deleted the sohil/fix/service-logging-improvements branch March 22, 2026 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants