fix: 'Failed to poll job status' error during duplicate cleanup operations

## Description

Users are experiencing a \"Failed to poll job status\" error message in the UI during duplicate cleanup operations. The error appears to be transient - the duplicate detection operation continues and completes successfully despite the polling error.

## Context

**Type**: bug  
**Scope**: small  
**Complexity**: medium  
**Priority**: medium

**User Report**: Error occurs during dry run duplicate cleanup, but the operation pushes forward and completes.

## Relevant Files

| File | Lines | Relevance | Notes |
|------|-------|-----------|-------|
| `app/hooks/useCleanup.ts` | 175-244 | HIGH | Contains pollJobStatus function that displays the error |
| `app/api/migration/cleanup/[jobId]/route.ts` | 16-58 | HIGH | API endpoint being polled, returns 500 on errors |
| `lib/migration/cleanup/service.ts` | 107-145 | HIGH | getCleanupJobStatus function that may throw errors |
| `lib/migration/state-store.ts` | 504-540 | MEDIUM | getMigrationJob with retry logic and file I/O |

## Analysis

### Polling Logic (app/hooks/useCleanup.ts)

The `pollJobStatus` function already has **correct AbortError handling** (lines 234-237), so this is a **real error**, not a mishandled abort.

**Error Flow**:
1. Frontend polls `/api/migration/cleanup/{jobId}` every few seconds
2. API returns non-200 response (not 404)
3. Error caught at line 241: `setError(err instanceof Error ? err.message : 'Failed to poll job status')`
4. Polling stops (`setIsPolling(false)`)

### Potential Root Causes

**Hypothesis 1: Race Condition on Job Creation**
- Polling may start before job file is fully written to disk
- `createCleanupJob` saves job asynchronously
- First poll attempt might happen before `saveMigrationJob` completes
- State store has retry logic (5 attempts) but may not be sufficient for the timing

**Hypothesis 2: File I/O Contention**
- Multiple rapid writes/reads to the same job file
- State store uses write queue but race conditions possible
- `getMigrationJob` waits for pending writes but timing-sensitive

**Hypothesis 3: Transient File System Issues**
- JSON.parse failures if file is partially written
- Permission errors (less likely given operation succeeds)
- Network file system delays (if running on network drive)

**Hypothesis 4: API Route Handler Issues**
- Next.js 15 dynamic route param handling: `await params` pattern
- Potential timing issues with async param resolution
- Error thrown before job file is accessible

### Why Operation Continues Successfully

The cleanup operation itself runs independently of the polling mechanism. The polling is only for UI progress updates, so when polling fails, the operation continues in the background and eventually completes.

## Acceptance Criteria

- [ ] Identify root cause of polling failure (add detailed logging)
- [ ] Add retry logic to frontend polling (with exponential backoff)
- [ ] Add more detailed error messages (distinguish between different failure types)
- [ ] Ensure polling doesn't start until job is confirmed created
- [ ] Consider debouncing rapid poll requests
- [ ] Add health check endpoint to verify job exists before polling starts
- [ ] Log response status and body when polling fails (for debugging)
- [ ] Test with rapid job creation/polling scenarios

## Testing Requirements

### Unit Tests
- [ ] Test pollJobStatus with various error responses (400, 500, etc.)
- [ ] Test job creation followed by immediate polling
- [ ] Test getMigrationJob retry logic timing

### Integration Tests
- [ ] Create cleanup job and immediately start polling
- [ ] Simulate file I/O delays during job creation
- [ ] Test polling behavior with concurrent job operations
- [ ] Verify error messages are user-friendly

### Manual Tests
- [ ] Test duplicate cleanup dry run (reproduce user's scenario)
- [ ] Check browser Network tab for API response details
- [ ] Monitor state-store.log during job creation/polling
- [ ] Test on different file systems (local vs network)

## Additional Considerations

### Debugging Steps

1. **Add detailed API logging**:
   - Log exact timing of job creation vs first poll
   - Log file system operations in getMigrationJob
   - Log API response status codes and bodies

2. **Check Network tab**:
   - What HTTP status is returned when error occurs?
   - What's in the response body?
   - Timing of request (how soon after job creation?)

3. **Check logs**:
   - `logs/migration.log` for state store operations
   - Console logs for API errors

### Immediate Workarounds

**Option 1: Delay initial poll**
- Add 500ms delay before starting polling
- Gives job file time to be fully written

**Option 2: Retry on frontend**
- Catch 500 errors and retry 2-3 times before showing error to user
- Already has AbortError handling, add general retry logic

**Option 3: Health check before polling**
- Add lightweight `/api/migration/cleanup/{jobId}/exists` endpoint
- Poll this first to confirm job is ready

### Performance
- Polling interval is appropriate (no performance concern)
- File I/O in state store uses write queue (good)
- Consider using Redis or database for high-concurrency scenarios

---

**Investigation Summary**:
- Files analyzed: 4
- Confidence: MEDIUM (need to reproduce with logging to confirm root cause)

*Issue created with Claude Code (direct investigation)*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 'Failed to poll job status' error during duplicate cleanup operations #44

Description

Context

Relevant Files

Analysis

Polling Logic (app/hooks/useCleanup.ts)

Potential Root Causes

Why Operation Continues Successfully

Acceptance Criteria

Testing Requirements

Unit Tests

Integration Tests

Manual Tests

Additional Considerations

Debugging Steps

Immediate Workarounds

Performance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Lines	Relevance	Notes
`app/hooks/useCleanup.ts`	175-244	HIGH	Contains pollJobStatus function that displays the error
`app/api/migration/cleanup/[jobId]/route.ts`	16-58	HIGH	API endpoint being polled, returns 500 on errors
`lib/migration/cleanup/service.ts`	107-145	HIGH	getCleanupJobStatus function that may throw errors
`lib/migration/state-store.ts`	504-540	MEDIUM	getMigrationJob with retry logic and file I/O

fix: 'Failed to poll job status' error during duplicate cleanup operations #44

Description

Description

Context

Relevant Files

Analysis

Polling Logic (app/hooks/useCleanup.ts)

Potential Root Causes

Why Operation Continues Successfully

Acceptance Criteria

Testing Requirements

Unit Tests

Integration Tests

Manual Tests

Additional Considerations

Debugging Steps

Immediate Workarounds

Performance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions