Skip to content

fix: 'Failed to poll job status' error during duplicate cleanup operations #44

@rdfitted

Description

@rdfitted

Description

Users are experiencing a "Failed to poll job status" error message in the UI during duplicate cleanup operations. The error appears to be transient - the duplicate detection operation continues and completes successfully despite the polling error.

Context

Type: bug
Scope: small
Complexity: medium
Priority: medium

User Report: Error occurs during dry run duplicate cleanup, but the operation pushes forward and completes.

Relevant Files

File Lines Relevance Notes
app/hooks/useCleanup.ts 175-244 HIGH Contains pollJobStatus function that displays the error
app/api/migration/cleanup/[jobId]/route.ts 16-58 HIGH API endpoint being polled, returns 500 on errors
lib/migration/cleanup/service.ts 107-145 HIGH getCleanupJobStatus function that may throw errors
lib/migration/state-store.ts 504-540 MEDIUM getMigrationJob with retry logic and file I/O

Analysis

Polling Logic (app/hooks/useCleanup.ts)

The pollJobStatus function already has correct AbortError handling (lines 234-237), so this is a real error, not a mishandled abort.

Error Flow:

  1. Frontend polls /api/migration/cleanup/{jobId} every few seconds
  2. API returns non-200 response (not 404)
  3. Error caught at line 241: setError(err instanceof Error ? err.message : 'Failed to poll job status')
  4. Polling stops (setIsPolling(false))

Potential Root Causes

Hypothesis 1: Race Condition on Job Creation

  • Polling may start before job file is fully written to disk
  • createCleanupJob saves job asynchronously
  • First poll attempt might happen before saveMigrationJob completes
  • State store has retry logic (5 attempts) but may not be sufficient for the timing

Hypothesis 2: File I/O Contention

  • Multiple rapid writes/reads to the same job file
  • State store uses write queue but race conditions possible
  • getMigrationJob waits for pending writes but timing-sensitive

Hypothesis 3: Transient File System Issues

  • JSON.parse failures if file is partially written
  • Permission errors (less likely given operation succeeds)
  • Network file system delays (if running on network drive)

Hypothesis 4: API Route Handler Issues

  • Next.js 15 dynamic route param handling: await params pattern
  • Potential timing issues with async param resolution
  • Error thrown before job file is accessible

Why Operation Continues Successfully

The cleanup operation itself runs independently of the polling mechanism. The polling is only for UI progress updates, so when polling fails, the operation continues in the background and eventually completes.

Acceptance Criteria

  • Identify root cause of polling failure (add detailed logging)
  • Add retry logic to frontend polling (with exponential backoff)
  • Add more detailed error messages (distinguish between different failure types)
  • Ensure polling doesn't start until job is confirmed created
  • Consider debouncing rapid poll requests
  • Add health check endpoint to verify job exists before polling starts
  • Log response status and body when polling fails (for debugging)
  • Test with rapid job creation/polling scenarios

Testing Requirements

Unit Tests

  • Test pollJobStatus with various error responses (400, 500, etc.)
  • Test job creation followed by immediate polling
  • Test getMigrationJob retry logic timing

Integration Tests

  • Create cleanup job and immediately start polling
  • Simulate file I/O delays during job creation
  • Test polling behavior with concurrent job operations
  • Verify error messages are user-friendly

Manual Tests

  • Test duplicate cleanup dry run (reproduce user's scenario)
  • Check browser Network tab for API response details
  • Monitor state-store.log during job creation/polling
  • Test on different file systems (local vs network)

Additional Considerations

Debugging Steps

  1. Add detailed API logging:

    • Log exact timing of job creation vs first poll
    • Log file system operations in getMigrationJob
    • Log API response status codes and bodies
  2. Check Network tab:

    • What HTTP status is returned when error occurs?
    • What's in the response body?
    • Timing of request (how soon after job creation?)
  3. Check logs:

    • logs/migration.log for state store operations
    • Console logs for API errors

Immediate Workarounds

Option 1: Delay initial poll

  • Add 500ms delay before starting polling
  • Gives job file time to be fully written

Option 2: Retry on frontend

  • Catch 500 errors and retry 2-3 times before showing error to user
  • Already has AbortError handling, add general retry logic

Option 3: Health check before polling

  • Add lightweight /api/migration/cleanup/{jobId}/exists endpoint
  • Poll this first to confirm job is ready

Performance

  • Polling interval is appropriate (no performance concern)
  • File I/O in state store uses write queue (good)
  • Consider using Redis or database for high-concurrency scenarios

Investigation Summary:

  • Files analyzed: 4
  • Confidence: MEDIUM (need to reproduce with logging to confirm root cause)

Issue created with Claude Code (direct investigation)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions