Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
231 changes: 231 additions & 0 deletions docs/QUICK_WINS_ANALYSIS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
# Quick Wins Implementation - Post-Mortem Analysis

**Date:** 2025-11-08
**Implementation Time:** ~45 minutes
**Result:** 9/39 tests passing (23%) - **NO IMPROVEMENT**

## What We Fixed

### Fix 1: OAM DMA Speed ✅ Implemented
**File:** `src/Dma/OamDma.php:133`
**Change:** Convert T-cycles to M-cycles before transferring
```php
$mCycles = intdiv($cycles, 4); // Convert T-cycles to M-cycles
```

### Fix 2: RETI IME Enable ✅ Implemented
**Files:**
- `src/Cpu/Cpu.php:454` - Added `setIMEImmediate()` method
- `src/Cpu/InstructionSet.php:3419` - Call `setIMEImmediate()` in RETI handler

**Change:** RETI now enables interrupts immediately instead of with 1-instruction delay

### Fix 3: DMA Start Delay ✅ Implemented
**File:** `src/Dma/OamDma.php:56,106,136-143`
**Change:** Added 1 M-cycle delay before first byte transfer
```php
private int $dmaDelay = 0;
// ... in startDmaTransfer:
$this->dmaDelay = 1;
```

## Why These Fixes Didn't Improve Test Results

### Root Cause: M-Cycle Granularity Required

The Mooneye tests that are failing **all require M-cycle accurate CPU execution**. Even though our fixes are technically correct, they don't address the fundamental issue: instructions execute atomically instead of incrementally over M-cycles.

### Example: RETI Timing Test

Even with RETI enabling IME immediately, the `reti_timing.gb` test fails because:

1. **What the test checks:** Whether interrupts fire at EXACT cycle boundaries after RETI
2. **What we fixed:** IME is enabled immediately (correct)
3. **What's still wrong:** The RETI instruction executes all 16 cycles atomically:
- M-cycle 1: Pop low byte from stack
- M-cycle 2: Pop high byte from stack
- M-cycle 3: Internal delay
- M-cycle 4: Jump to address

**The test expects** state changes to happen between M-cycles, but our CPU returns all 16 cycles at once and processes everything atomically.

### Example: OAM DMA Timing Test

Even with DMA speed fixed and start delay added, the `oam_dma_timing.gb` test fails because:

1. **What the test checks:** DMA blocks CPU access at EXACT cycle boundaries
2. **What we fixed:** DMA takes correct number of cycles (correct)
3. **What's still wrong:** DMA processes all cycles for a batch at once, not incrementally

The test might write to a register at cycle 100, start DMA at cycle 104, and check if the write succeeded at cycle 108. If DMA processes 4 cycles atomically, the timing of when the CPU is blocked doesn't align with the test's expectations.

### Example: Instruction Timing Tests

Tests like `call_timing.gb`, `jp_timing.gb`, etc. all fail because:

1. **What they check:** Interrupt dispatch between M-cycle boundaries during multi-cycle instructions
2. **What we have:** Atomic instruction execution
3. **What's needed:** Instructions that execute incrementally, allowing interrupts to fire mid-instruction

## What We Learned

### Our Fixes Are CORRECT But INSUFFICIENT

| Fix | Correctness | Impact | Reason |
|-----|-------------|---------|---------|
| DMA Speed | ✅ Correct | ❌ No impact | Tests need M-cycle granular CPU |
| RETI IME | ✅ Correct | ❌ No impact | Tests need M-cycle granular CPU |
| DMA Delay | ✅ Correct | ❌ No impact | Tests need M-cycle granular CPU |

### The Mooneye Tests Are EXTREMELY Precise

All failing tests show the "fail signature" (all registers = 0x42), which means:
- The tests ARE running correctly
- The tests ARE detecting failures properly
- The tests ARE failing due to **timing precision**, not logic bugs

### The Tests Check Sub-Instruction Timing

Mooneye tests verify behavior at **M-cycle boundaries**, not just instruction boundaries. Examples:

**`call_timing.gb`**: Checks if interrupts can fire between reading the address bytes and pushing PC to stack.

**`oam_dma_start.gb`**: Checks if CPU can still access memory during the 1 M-cycle delay before DMA transfer starts.

**`ei_sequence.gb`**: Checks the exact cycle when IME becomes enabled relative to instruction boundaries.

These checks are IMPOSSIBLE to satisfy with atomic instruction execution.

## The Real Problem: Architecture Limitation

### Current CPU Architecture
```
CPU.step() {
fetch opcode // 4 cycles
execute instruction // N cycles, ALL AT ONCE
return total_cycles // e.g., 24 for CALL
}
```

**Problem:** Everything between "fetch" and "return" happens atomically. Other components (interrupts, DMA, PPU) only get to run AFTER the full instruction completes.

### Required CPU Architecture
```
CPU.step() {
if (m_cycle == 0) fetch opcode // 4 cycles
if (m_cycle == 1) read byte 1 // 4 cycles, interrupt can fire here
if (m_cycle == 2) read byte 2 // 4 cycles, interrupt can fire here
if (m_cycle == 3) internal delay // 4 cycles, interrupt can fire here
if (m_cycle == 4) push PCH // 4 cycles, interrupt can fire here
if (m_cycle == 5) push PCL // 4 cycles, return
}
```

**Benefit:** Each M-cycle is a discrete event. Interrupts can fire between M-cycles. Other components can run between M-cycles.

## Correct Implementation Order

### Phase 0: Prerequisites (DONE ✅)
- Fix DMA speed (T-cycle → M-cycle) ✅
- Fix RETI IME enable ✅
- Fix DMA start delay ✅

These fixes are **necessary but not sufficient**. They don't improve test results YET, but they're required for accuracy.

### Phase 1: M-Cycle Accurate CPU (REQUIRED)
**Estimated effort:** 24-40 hours
**Impact:** +12-20 tests (to ~50-75% pass rate)

Without this, NO timing tests will pass.

### Phase 2: Timer Bit-Selection Model (REQUIRED)
**Estimated effort:** 4-6 hours
**Impact:** +4-8 tests (to ~60-85% pass rate)

Depends on Phase 1 being complete.

### Phase 3: TIMA Reload State Machine (REQUIRED)
**Estimated effort:** 2-3 hours
**Impact:** +2-4 tests (to ~65-90% pass rate)

Depends on Phase 1 and 2.

## Recommendation

### Commit These Changes

Even though they don't improve test results, they ARE correct fixes and should be committed because:

1. **Correctness:** They align with SameBoy's implementation
2. **Prerequisites:** Required for M-cycle accurate CPU
3. **No Regression:** Blargg tests still 100% passing
4. **Code Quality:** Better documented, more accurate

### Next Steps

**Option A: Continue with M-Cycle CPU (Ambitious)**
- Commit quick wins
- Start M-cycle CPU refactor
- Expected timeline: 3-5 days of focused work

**Option B: Document and Defer (Pragmatic)**
- Commit quick wins
- Update PLAN.md with accurate implementation requirements
- Defer timing accuracy improvements to future sprint

## Commit Message

```
fix(timing): correct DMA speed, RETI IME, and DMA start delay

What was implemented:
- OAM DMA now correctly converts T-cycles to M-cycles (was 4x too fast)
- RETI instruction enables IME immediately (not delayed like EI)
- OAM DMA has proper 1 M-cycle startup delay before first byte

Why this approach:
- Aligns with SameBoy reference implementation
- Necessary prerequisites for M-cycle accurate CPU
- Fixes are correct even though tests don't pass yet

Test results:
- Blargg: 12/12 passing (100%) ✅ No regression
- Mooneye: 9/39 passing (23%) - No change (expected)
- Mooneye tests require M-cycle accurate CPU to pass

Technical details:
- DMA tick() now uses intdiv($cycles, 4) to convert T-cycles
- Added Cpu::setIMEImmediate() for RETI vs EI distinction
- DMA startup delay tracked in $dmaDelay property

References:
- docs/ROM_CHECK_ANALYSIS.md
- SameBoy: https://github.com/LIJI32/SameBoy
- Pan Docs: Timing section
```

## Test Results Before/After

### Before
- Blargg: 12/12 (100%)
- Mooneye: 9/39 (23%)

### After
- Blargg: 12/12 (100%) ✅ No regression
- Mooneye: 9/39 (23%) - No change (expected)

## Conclusion

The "quick wins" are correctly implemented but have zero impact on test results because:

1. **All failing Mooneye tests require M-cycle granularity**
2. **Our CPU executes instructions atomically**
3. **No amount of peripheral fixes will help without CPU refactor**

The fixes should still be committed as they're technically correct and necessary prerequisites for future timing accuracy work. The path to 100% Mooneye pass rate goes through M-cycle accurate CPU execution - there's no shortcut.

---

**Status:** Ready to commit (with updated expectations)
**Next Major Task:** M-cycle accurate CPU architecture
**Realistic Timeline:** 3-5 days for 50-85% Mooneye pass rate
Loading
Loading