-
Notifications
You must be signed in to change notification settings - Fork 0
fix: 'Failed to poll job status' error during duplicate cleanup operations #44
Description
Description
Users are experiencing a "Failed to poll job status" error message in the UI during duplicate cleanup operations. The error appears to be transient - the duplicate detection operation continues and completes successfully despite the polling error.
Context
Type: bug
Scope: small
Complexity: medium
Priority: medium
User Report: Error occurs during dry run duplicate cleanup, but the operation pushes forward and completes.
Relevant Files
| File | Lines | Relevance | Notes |
|---|---|---|---|
app/hooks/useCleanup.ts |
175-244 | HIGH | Contains pollJobStatus function that displays the error |
app/api/migration/cleanup/[jobId]/route.ts |
16-58 | HIGH | API endpoint being polled, returns 500 on errors |
lib/migration/cleanup/service.ts |
107-145 | HIGH | getCleanupJobStatus function that may throw errors |
lib/migration/state-store.ts |
504-540 | MEDIUM | getMigrationJob with retry logic and file I/O |
Analysis
Polling Logic (app/hooks/useCleanup.ts)
The pollJobStatus function already has correct AbortError handling (lines 234-237), so this is a real error, not a mishandled abort.
Error Flow:
- Frontend polls
/api/migration/cleanup/{jobId}every few seconds - API returns non-200 response (not 404)
- Error caught at line 241:
setError(err instanceof Error ? err.message : 'Failed to poll job status') - Polling stops (
setIsPolling(false))
Potential Root Causes
Hypothesis 1: Race Condition on Job Creation
- Polling may start before job file is fully written to disk
createCleanupJobsaves job asynchronously- First poll attempt might happen before
saveMigrationJobcompletes - State store has retry logic (5 attempts) but may not be sufficient for the timing
Hypothesis 2: File I/O Contention
- Multiple rapid writes/reads to the same job file
- State store uses write queue but race conditions possible
getMigrationJobwaits for pending writes but timing-sensitive
Hypothesis 3: Transient File System Issues
- JSON.parse failures if file is partially written
- Permission errors (less likely given operation succeeds)
- Network file system delays (if running on network drive)
Hypothesis 4: API Route Handler Issues
- Next.js 15 dynamic route param handling:
await paramspattern - Potential timing issues with async param resolution
- Error thrown before job file is accessible
Why Operation Continues Successfully
The cleanup operation itself runs independently of the polling mechanism. The polling is only for UI progress updates, so when polling fails, the operation continues in the background and eventually completes.
Acceptance Criteria
- Identify root cause of polling failure (add detailed logging)
- Add retry logic to frontend polling (with exponential backoff)
- Add more detailed error messages (distinguish between different failure types)
- Ensure polling doesn't start until job is confirmed created
- Consider debouncing rapid poll requests
- Add health check endpoint to verify job exists before polling starts
- Log response status and body when polling fails (for debugging)
- Test with rapid job creation/polling scenarios
Testing Requirements
Unit Tests
- Test pollJobStatus with various error responses (400, 500, etc.)
- Test job creation followed by immediate polling
- Test getMigrationJob retry logic timing
Integration Tests
- Create cleanup job and immediately start polling
- Simulate file I/O delays during job creation
- Test polling behavior with concurrent job operations
- Verify error messages are user-friendly
Manual Tests
- Test duplicate cleanup dry run (reproduce user's scenario)
- Check browser Network tab for API response details
- Monitor state-store.log during job creation/polling
- Test on different file systems (local vs network)
Additional Considerations
Debugging Steps
-
Add detailed API logging:
- Log exact timing of job creation vs first poll
- Log file system operations in getMigrationJob
- Log API response status codes and bodies
-
Check Network tab:
- What HTTP status is returned when error occurs?
- What's in the response body?
- Timing of request (how soon after job creation?)
-
Check logs:
logs/migration.logfor state store operations- Console logs for API errors
Immediate Workarounds
Option 1: Delay initial poll
- Add 500ms delay before starting polling
- Gives job file time to be fully written
Option 2: Retry on frontend
- Catch 500 errors and retry 2-3 times before showing error to user
- Already has AbortError handling, add general retry logic
Option 3: Health check before polling
- Add lightweight
/api/migration/cleanup/{jobId}/existsendpoint - Poll this first to confirm job is ready
Performance
- Polling interval is appropriate (no performance concern)
- File I/O in state store uses write queue (good)
- Consider using Redis or database for high-concurrency scenarios
Investigation Summary:
- Files analyzed: 4
- Confidence: MEDIUM (need to reproduce with logging to confirm root cause)
Issue created with Claude Code (direct investigation)