Skip to content

[BUG] Langfuse sync retries indefinitely against stale refresh jobs with no circuit breaker #374

@jsbattig

Description

@jsbattig

Description

When LangfuseTraceSyncService triggers a refresh for a per-user Langfuse repo and the submission fails with DuplicateJobError (stale RUNNING job from Bug #372), the error is caught and logged as a warning but there is no circuit breaker, backoff, or escalation. The sync service retries on every sync cycle (typically every 60 seconds) indefinitely, generating the same warning log entry forever.

Reproduction Steps

  1. Have Bug [BUG] Stale RUNNING jobs not recovered after server restart block all subsequent refresh submissions #372 condition present (stale RUNNING job after server restart)
  2. Observe Langfuse sync service running on its configured interval
  3. Watch logs accumulate identical "already running" warnings every cycle

Expected Behavior

After N consecutive DuplicateJobError failures for the same repo, the sync service should either:

  • Detect and clean up stale jobs (jobs that have been "RUNNING" for longer than cidx_index_timeout + buffer)
  • Implement exponential backoff on retry frequency for that specific repo
  • Log at ERROR level with remediation guidance after a threshold (e.g., "Job has been running for >1 hour, likely stale -- consider manual cleanup")

Actual Behavior

02:45:55 WARNING - Failed to trigger refresh for 'langfuse_Claude_Code_seba.battig-global': already running (job_id: 73a3d173)
02:52:06 WARNING - Failed to trigger refresh: already running (job_id: 372ae80d)
02:53:07 WARNING - Failed to trigger refresh: already running (job_id: 372ae80d)
02:54:13 WARNING - Failed to trigger refresh: already running (job_id: 372ae80d)
02:56:15 WARNING - Failed to trigger refresh: already running (job_id: 07ad650c)

Same warning, same stale job ID, every sync cycle, forever.

Root Cause

In langfuse_trace_sync_service.py:253-259, the refresh trigger is wrapped in a bare except Exception that logs a warning and continues:

if self._refresh_scheduler is not None:
    for repo_folder_name in modified_repos:
        trigger_alias = f"{repo_folder_name}-global"
        try:
            self._refresh_scheduler.trigger_refresh_for_repo(trigger_alias)
        except Exception as e:
            logger.warning(f"Failed to trigger refresh for '{trigger_alias}': {e}")

No retry tracking, no backoff, no staleness detection.

Affected Files

  • src/code_indexer/server/services/langfuse_trace_sync_service.py:253-259 - Missing circuit breaker
  • src/code_indexer/server/repositories/background_jobs.py - No API to detect stale jobs

Proposed Fix

Primary (depends on Bug #372 fix): Once stale jobs are cleaned on startup, this bug becomes moot for the restart scenario.

Defense-in-depth: Add a staleness check before raising DuplicateJobError. In _check_operation_conflict(), check if the conflicting job's started_at is older than cidx_index_timeout + 300s (buffer). If so, transition it to FAILED automatically and allow the new submission.

Optional: Add per-repo retry counter in LangfuseTraceSyncService that escalates from WARNING to ERROR after 3 consecutive failures for the same alias, with a suggestion to check job status.

Impact

  • Medium: Log noise (thousands of identical warnings per day), but the actual data continues syncing to disk correctly
  • The warnings obscure real errors in the log stream
  • Secondary: each failed submission attempt adds SQLite read pressure from _check_operation_conflict() scanning all jobs
  • Contributes to "database is locked" contention observed in logs

Evidence (Staging Logs 2026-03-07)

Over 15 identical "already running" warnings across a 6-hour window for just two repos:

  • langfuse_Claude_Code_unknown-global - 5+ warnings with rotating stale job IDs
  • langfuse_Claude_Code_seba.battig_lightspeeddms.com-global - 10+ warnings with rotating stale job IDs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions