diff --git a/STEP4_STATUS.md b/STEP4_STATUS.md deleted file mode 100644 index e1189a6..0000000 --- a/STEP4_STATUS.md +++ /dev/null @@ -1,160 +0,0 @@ -# Step 4 - Core Instruction Set Implementation - STATUS - -## ✅ COMPLETED COMPONENTS - -### 1. Full Instruction Set Implementation (100%) -- **All 256 base opcodes** (0x00-0xFF) - ✅ COMPLETE -- **All 256 CB-prefixed opcodes** (0xCB00-0xCBFF) - ✅ COMPLETE -- **Total: 512 instructions** with full handlers -- **File**: `src/Cpu/InstructionSet.php` (4,130 lines) - -#### Instruction Categories Implemented: -- ✅ Load instructions (LD r,r | LD r,n | LD r,(HL) | LDH | LDI | LDD) -- ✅ 8-bit ALU (ADD, ADC, SUB, SBC, AND, XOR, OR, CP) -- ✅ 16-bit arithmetic (ADD HL,rr | INC rr | DEC rr) -- ✅ 8-bit INC/DEC for all registers -- ✅ Stack operations (PUSH/POP for BC, DE, HL, AF) -- ✅ Control flow (JP, JR, CALL, RET, RETI, RST) -- ✅ Special operations (DAA, CPL, CCF, SCF, HALT, STOP, DI, EI) -- ✅ CB bit operations (BIT, SET, RES) -- ✅ CB rotates/shifts (RLC, RRC, RL, RR, SLA, SRA, SRL, SWAP) - -### 2. CPU Infrastructure (100%) -- ✅ CPU state management (halted, stopped, IME flags) -- ✅ Helper methods (readImm8/readImm16, halfCarry detection) -- ✅ FlagRegister with convenience aliases (getZ/setZ, getN/setN, getH/setH, getC/setC) -- ✅ Proper cycle counting for all instructions -- ✅ Conditional branch cycle timing (taken vs not-taken) - -### 3. Comprehensive Unit Tests (100%) -- ✅ **60+ test cases** covering all instruction categories -- ✅ File: `tests/Unit/Cpu/InstructionSetTest.php` (917 lines) - -#### Test Coverage: -- ✅ 8-bit and 16-bit load instructions -- ✅ All ALU operations with flag verification -- ✅ INC/DEC edge cases (overflow, underflow, half-carry) -- ✅ 16-bit arithmetic with carry propagation -- ✅ DAA (Decimal Adjust) - multiple scenarios (addition, subtraction, carry) -- ✅ Special operations (CPL, SCF, CCF) -- ✅ Rotate operations (RLCA, RRCA, RLA, RRA) -- ✅ Stack operations (PUSH/POP with proper byte ordering) -- ✅ Jump operations with cycle timing verification -- ✅ CALL/RET/RST with stack verification -- ✅ CB-prefixed instructions (rotates, BIT, SET, RES, SWAP) -- ✅ HALT/STOP state management -- ✅ Interrupt control (DI/EI) - -### 4. Test Infrastructure -- ✅ MockBus implementation for testing (`src/Bus/MockBus.php`) -- ✅ Existing CPU core tests (`tests/Unit/Cpu/CpuTest.php`) -- ✅ Blargg test ROMs available (`third_party/roms/cpu_instrs/`) - -## ⏸️ PENDING COMPONENTS (Require Docker/Full System) - -### 1. Test Execution (Cannot run without Docker) -- ⏸️ Run `make test` to execute PHPUnit tests -- ⏸️ Verify all 60+ unit tests pass -- **Blocker**: Docker not available in current environment -- **Command**: `make test` - -### 2. Static Analysis (Cannot run without Docker) -- ⏸️ Run `make lint` (PHPStan level 9) -- ⏸️ Fix any type errors or static analysis issues -- **Blocker**: Docker not available in current environment -- **Command**: `make lint` - -### 3. Blargg ROM Tests (Require Full System) -- ⏸️ All 11 Blargg cpu_instrs test ROMs - - 01-special.gb - - 02-interrupts.gb (requires interrupt handling - Step 6) - - 03-op sp,hl.gb - - 04-op r,imm.gb - - 05-op rp.gb - - 06-ld r,r.gb - - 07-jr,jp,call,ret,rst.gb - - 08-misc instrs.gb - - 09-op r,r.gb - - 10-bit ops.gb - - 11-op a,(hl).gb -- **Blockers**: - - Requires Docker environment - - Requires complete memory bus (Step 5) - - Requires interrupt handling for some tests (Step 6) - - Requires serial output to read test results - -## 📊 COMPLETION METRICS - -| Component | Status | Progress | -|-----------|--------|----------| -| Instruction Implementation | ✅ Complete | 512/512 (100%) | -| Unit Tests Written | ✅ Complete | 60+ tests | -| Unit Tests Executed | ⏸️ Pending | Requires Docker | -| Static Analysis | ⏸️ Pending | Requires Docker | -| Blargg ROM Tests | ⏸️ Pending | Requires Steps 5-6 | - -## 🎯 DEFINITION OF DONE (Per PLAN.md) - -### ✅ Completed Requirements: -- [x] All 256 base opcodes implemented (0x00-0xFF) -- [x] CB-prefixed instructions implemented (0xCB00-0xCBFF) -- [x] All instruction categories implemented (loads, ALU, 16-bit, jumps, special) -- [x] Flag handling implemented correctly (Z, N, H, C) -- [x] Unit tests exist for complex instructions (DAA, flags, 16-bit arithmetic) -- [x] HALT instruction properly implemented -- [x] STOP instruction implemented -- [x] Cycle counting accurate - -### ⏸️ Pending Requirements: -- [ ] `make test` passes with 100% pass rate (requires Docker) -- [ ] `make lint` passes with 0 errors (requires Docker) -- [ ] Blargg cpu_instrs test ROMs pass (requires Steps 5-6) - -## 🔄 NEXT STEPS - -### Option A: Continue Step 4 (Requires Environment Setup) -1. Set up Docker environment -2. Run `make test` and verify all tests pass -3. Run `make lint` and fix any issues -4. Implement minimal memory bus for ROM testing -5. Run Blargg test ROMs and fix failures - -### Option B: Proceed to Step 5 (Recommended) -1. Implement complete Memory Map & Bus (Step 5) -2. Implement Interrupts & Timers (Step 6) -3. Return to Step 4 for Blargg ROM validation -4. This provides the infrastructure needed for full system testing - -## 📈 CODE STATISTICS - -``` -Instruction Set: 4,130 lines -Unit Tests: 917 lines -Total Test Coverage: 60+ test cases -CPU Infrastructure: 371 lines -Helper Utilities: 122 lines (BitOps) -``` - -## 🔗 RELATED FILES - -- `src/Cpu/InstructionSet.php` - Complete instruction set implementation -- `src/Cpu/Cpu.php` - CPU core with state management -- `src/Cpu/Instruction.php` - Instruction metadata structure -- `src/Bus/MockBus.php` - Test memory bus -- `tests/Unit/Cpu/InstructionSetTest.php` - Comprehensive instruction tests -- `tests/Unit/Cpu/CpuTest.php` - Basic CPU tests -- `third_party/roms/cpu_instrs/` - Blargg test ROMs (11 files) - -## 📝 COMMITS - -1. `377dd3d` - WIP: Partial implementation (opcodes 0x00-0x87) -2. `0fd4eee` - Complete implementation of all 512 instructions -3. `f976c8f` - Add comprehensive instruction set unit tests - -## ✅ CONCLUSION - -**Step 4 core implementation is functionally complete.** All 512 CPU instructions are implemented with proper flag handling, cycle counting, and comprehensive unit tests. The remaining work (test execution, linting, ROM validation) requires either: -1. A Docker environment to run the existing test suite, or -2. Completion of Steps 5-6 to provide the full system infrastructure - -The instruction set is ready for integration testing once the memory bus and interrupt system are implemented. diff --git a/docs/QUICK_WINS_ANALYSIS.md b/docs/QUICK_WINS_ANALYSIS.md deleted file mode 100644 index a2d7e6d..0000000 --- a/docs/QUICK_WINS_ANALYSIS.md +++ /dev/null @@ -1,231 +0,0 @@ -# Quick Wins Implementation - Post-Mortem Analysis - -**Date:** 2025-11-08 -**Implementation Time:** ~45 minutes -**Result:** 9/39 tests passing (23%) - **NO IMPROVEMENT** - -## What We Fixed - -### Fix 1: OAM DMA Speed ✅ Implemented -**File:** `src/Dma/OamDma.php:133` -**Change:** Convert T-cycles to M-cycles before transferring -```php -$mCycles = intdiv($cycles, 4); // Convert T-cycles to M-cycles -``` - -### Fix 2: RETI IME Enable ✅ Implemented -**Files:** -- `src/Cpu/Cpu.php:454` - Added `setIMEImmediate()` method -- `src/Cpu/InstructionSet.php:3419` - Call `setIMEImmediate()` in RETI handler - -**Change:** RETI now enables interrupts immediately instead of with 1-instruction delay - -### Fix 3: DMA Start Delay ✅ Implemented -**File:** `src/Dma/OamDma.php:56,106,136-143` -**Change:** Added 1 M-cycle delay before first byte transfer -```php -private int $dmaDelay = 0; -// ... in startDmaTransfer: -$this->dmaDelay = 1; -``` - -## Why These Fixes Didn't Improve Test Results - -### Root Cause: M-Cycle Granularity Required - -The Mooneye tests that are failing **all require M-cycle accurate CPU execution**. Even though our fixes are technically correct, they don't address the fundamental issue: instructions execute atomically instead of incrementally over M-cycles. - -### Example: RETI Timing Test - -Even with RETI enabling IME immediately, the `reti_timing.gb` test fails because: - -1. **What the test checks:** Whether interrupts fire at EXACT cycle boundaries after RETI -2. **What we fixed:** IME is enabled immediately (correct) -3. **What's still wrong:** The RETI instruction executes all 16 cycles atomically: - - M-cycle 1: Pop low byte from stack - - M-cycle 2: Pop high byte from stack - - M-cycle 3: Internal delay - - M-cycle 4: Jump to address - -**The test expects** state changes to happen between M-cycles, but our CPU returns all 16 cycles at once and processes everything atomically. - -### Example: OAM DMA Timing Test - -Even with DMA speed fixed and start delay added, the `oam_dma_timing.gb` test fails because: - -1. **What the test checks:** DMA blocks CPU access at EXACT cycle boundaries -2. **What we fixed:** DMA takes correct number of cycles (correct) -3. **What's still wrong:** DMA processes all cycles for a batch at once, not incrementally - -The test might write to a register at cycle 100, start DMA at cycle 104, and check if the write succeeded at cycle 108. If DMA processes 4 cycles atomically, the timing of when the CPU is blocked doesn't align with the test's expectations. - -### Example: Instruction Timing Tests - -Tests like `call_timing.gb`, `jp_timing.gb`, etc. all fail because: - -1. **What they check:** Interrupt dispatch between M-cycle boundaries during multi-cycle instructions -2. **What we have:** Atomic instruction execution -3. **What's needed:** Instructions that execute incrementally, allowing interrupts to fire mid-instruction - -## What We Learned - -### Our Fixes Are CORRECT But INSUFFICIENT - -| Fix | Correctness | Impact | Reason | -|-----|-------------|---------|---------| -| DMA Speed | ✅ Correct | ❌ No impact | Tests need M-cycle granular CPU | -| RETI IME | ✅ Correct | ❌ No impact | Tests need M-cycle granular CPU | -| DMA Delay | ✅ Correct | ❌ No impact | Tests need M-cycle granular CPU | - -### The Mooneye Tests Are EXTREMELY Precise - -All failing tests show the "fail signature" (all registers = 0x42), which means: -- The tests ARE running correctly -- The tests ARE detecting failures properly -- The tests ARE failing due to **timing precision**, not logic bugs - -### The Tests Check Sub-Instruction Timing - -Mooneye tests verify behavior at **M-cycle boundaries**, not just instruction boundaries. Examples: - -**`call_timing.gb`**: Checks if interrupts can fire between reading the address bytes and pushing PC to stack. - -**`oam_dma_start.gb`**: Checks if CPU can still access memory during the 1 M-cycle delay before DMA transfer starts. - -**`ei_sequence.gb`**: Checks the exact cycle when IME becomes enabled relative to instruction boundaries. - -These checks are IMPOSSIBLE to satisfy with atomic instruction execution. - -## The Real Problem: Architecture Limitation - -### Current CPU Architecture -``` -CPU.step() { - fetch opcode // 4 cycles - execute instruction // N cycles, ALL AT ONCE - return total_cycles // e.g., 24 for CALL -} -``` - -**Problem:** Everything between "fetch" and "return" happens atomically. Other components (interrupts, DMA, PPU) only get to run AFTER the full instruction completes. - -### Required CPU Architecture -``` -CPU.step() { - if (m_cycle == 0) fetch opcode // 4 cycles - if (m_cycle == 1) read byte 1 // 4 cycles, interrupt can fire here - if (m_cycle == 2) read byte 2 // 4 cycles, interrupt can fire here - if (m_cycle == 3) internal delay // 4 cycles, interrupt can fire here - if (m_cycle == 4) push PCH // 4 cycles, interrupt can fire here - if (m_cycle == 5) push PCL // 4 cycles, return -} -``` - -**Benefit:** Each M-cycle is a discrete event. Interrupts can fire between M-cycles. Other components can run between M-cycles. - -## Correct Implementation Order - -### Phase 0: Prerequisites (DONE ✅) -- Fix DMA speed (T-cycle → M-cycle) ✅ -- Fix RETI IME enable ✅ -- Fix DMA start delay ✅ - -These fixes are **necessary but not sufficient**. They don't improve test results YET, but they're required for accuracy. - -### Phase 1: M-Cycle Accurate CPU (REQUIRED) -**Estimated effort:** 24-40 hours -**Impact:** +12-20 tests (to ~50-75% pass rate) - -Without this, NO timing tests will pass. - -### Phase 2: Timer Bit-Selection Model (REQUIRED) -**Estimated effort:** 4-6 hours -**Impact:** +4-8 tests (to ~60-85% pass rate) - -Depends on Phase 1 being complete. - -### Phase 3: TIMA Reload State Machine (REQUIRED) -**Estimated effort:** 2-3 hours -**Impact:** +2-4 tests (to ~65-90% pass rate) - -Depends on Phase 1 and 2. - -## Recommendation - -### Commit These Changes - -Even though they don't improve test results, they ARE correct fixes and should be committed because: - -1. **Correctness:** They align with SameBoy's implementation -2. **Prerequisites:** Required for M-cycle accurate CPU -3. **No Regression:** Blargg tests still 100% passing -4. **Code Quality:** Better documented, more accurate - -### Next Steps - -**Option A: Continue with M-Cycle CPU (Ambitious)** -- Commit quick wins -- Start M-cycle CPU refactor -- Expected timeline: 3-5 days of focused work - -**Option B: Document and Defer (Pragmatic)** -- Commit quick wins -- Update PLAN.md with accurate implementation requirements -- Defer timing accuracy improvements to future sprint - -## Commit Message - -``` -fix(timing): correct DMA speed, RETI IME, and DMA start delay - -What was implemented: -- OAM DMA now correctly converts T-cycles to M-cycles (was 4x too fast) -- RETI instruction enables IME immediately (not delayed like EI) -- OAM DMA has proper 1 M-cycle startup delay before first byte - -Why this approach: -- Aligns with SameBoy reference implementation -- Necessary prerequisites for M-cycle accurate CPU -- Fixes are correct even though tests don't pass yet - -Test results: -- Blargg: 12/12 passing (100%) ✅ No regression -- Mooneye: 9/39 passing (23%) - No change (expected) -- Mooneye tests require M-cycle accurate CPU to pass - -Technical details: -- DMA tick() now uses intdiv($cycles, 4) to convert T-cycles -- Added Cpu::setIMEImmediate() for RETI vs EI distinction -- DMA startup delay tracked in $dmaDelay property - -References: -- docs/ROM_CHECK_ANALYSIS.md -- SameBoy: https://github.com/LIJI32/SameBoy -- Pan Docs: Timing section -``` - -## Test Results Before/After - -### Before -- Blargg: 12/12 (100%) -- Mooneye: 9/39 (23%) - -### After -- Blargg: 12/12 (100%) ✅ No regression -- Mooneye: 9/39 (23%) - No change (expected) - -## Conclusion - -The "quick wins" are correctly implemented but have zero impact on test results because: - -1. **All failing Mooneye tests require M-cycle granularity** -2. **Our CPU executes instructions atomically** -3. **No amount of peripheral fixes will help without CPU refactor** - -The fixes should still be committed as they're technically correct and necessary prerequisites for future timing accuracy work. The path to 100% Mooneye pass rate goes through M-cycle accurate CPU execution - there's no shortcut. - ---- - -**Status:** Ready to commit (with updated expectations) -**Next Major Task:** M-cycle accurate CPU architecture -**Realistic Timeline:** 3-5 days for 50-85% Mooneye pass rate diff --git a/docs/ROM_CHECK_ANALYSIS.md b/docs/ROM_CHECK_ANALYSIS.md deleted file mode 100644 index a5c8fda..0000000 --- a/docs/ROM_CHECK_ANALYSIS.md +++ /dev/null @@ -1,539 +0,0 @@ -# PHPBoy ROM Check Test Analysis - -**Date:** 2025-11-08 -**Current Status:** 9/39 Mooneye tests passing (23%), 12/12 Blargg tests passing (100%) - -## Executive Summary - -PHPBoy has excellent **instruction correctness** (100% Blargg pass rate) but lacks **timing accuracy** (23% Mooneye pass rate). The main issues are: - -1. **DMA runs 4x too fast** - treating T-cycles as M-cycles -2. **RETI doesn't enable interrupts** - missing IME enable -3. **Timer uses wrong model** - counter accumulation vs bit-selection -4. **CPU executes atomically** - no M-cycle boundaries - -Quick wins (Fixes 1-3) can improve from 23% to 41-46% with ~5 hours of work. - ---- - -## Current Test Results - -### Blargg CPU Instruction Tests: 12/12 ✅ -- 01-special.gb ✅ -- 02-interrupts.gb ✅ -- 03-op sp,hl.gb ✅ -- 04-op r,imm.gb ✅ -- 05-op rp.gb ✅ -- 06-ld r,r.gb ✅ -- 07-jr,jp,call,ret,rst.gb ✅ -- 08-misc instrs.gb ✅ -- 09-op r,r.gb ✅ -- 10-bit ops.gb ✅ -- 11-op a,(hl).gb ✅ -- instr_timing.gb ✅ - -### Mooneye Acceptance Tests: 9/39 (23%) - -**Passing (9 tests):** -- di_timing-GS ✅ -- halt_ime0_ei ✅ -- halt_ime0_nointr_timing ✅ -- halt_ime1_timing ✅ -- if_ie_registers ✅ -- intr_timing ✅ -- instr/daa ✅ -- timer/tim00_div_trigger ✅ -- timer/tim01 ✅ -- timer/tim11_div_trigger ✅ - -**Failing by Category:** - -**Instruction Timing (12 tests):** -- add_sp_e_timing ❌ -- call_cc_timing ❌ -- call_timing ❌ -- jp_cc_timing ❌ -- jp_timing ❌ -- ld_hl_sp_e_timing ❌ -- pop_timing ❌ -- push_timing ❌ -- ret_cc_timing ❌ -- ret_timing ❌ -- reti_timing ❌ -- rst_timing ❌ - -**Interrupt/EI Timing (4 tests):** -- ei_sequence ❌ -- ei_timing ❌ -- rapid_di_ei ❌ -- reti_intr_timing ❌ - -**OAM DMA Timing (3 tests):** -- oam_dma_restart ❌ -- oam_dma_start ❌ -- oam_dma_timing ❌ - -**Timer Edge Cases (13 tests):** -- timer/div_write ❌ -- timer/rapid_toggle ❌ -- timer/tim00 ❌ -- timer/tim01_div_trigger ❌ -- timer/tim10 ❌ -- timer/tim10_div_trigger ❌ -- timer/tim11 ❌ -- timer/tima_reload ❌ -- timer/tima_write_reloading ❌ -- timer/tma_write_reloading ❌ - ---- - -## Root Cause Analysis - -### 1. OAM DMA Running 4x Too Fast (CRITICAL BUG) - -**Location:** `src/Dma/OamDma.php:119` - -**Problem:** DMA `tick()` receives T-cycles from CPU but treats them as M-cycles, making DMA complete in 160 T-cycles instead of 640 T-cycles (160 M-cycles). - -**Current Implementation:** -```php -public function tick(int $cycles): void -{ - if (!$this->dmaActive) { - return; - } - - // Transfer one byte per M-cycle - for ($i = 0; $i < $cycles; $i++) { // ← Treats $cycles as M-cycles - if ($this->dmaProgress >= self::TRANSFER_LENGTH) { - $this->dmaActive = false; - break; - } - // ... transfer logic - } -} -``` - -**Call Site (Emulator.php:309-310):** -```php -$cycles = $this->cpu->step(); // Returns T-cycles (e.g., 4, 8, 12) -$this->oamDma?->tick($cycles); // ← Passes T-cycles -``` - -**Issue:** The loop iterates `$cycles` times (T-cycles), but should iterate `$cycles / 4` times (M-cycles). - -**Impact:** Fixes ~2-3 tests -- oam_dma_timing ❌ -- oam_dma_start ❌ - ---- - -### 2. RETI Doesn't Enable Interrupts (CRITICAL BUG) - -**Location:** `src/Cpu/InstructionSet.php:3407-3422` - -**Problem:** RETI instruction returns from interrupt but doesn't set IME flag. - -**Current Implementation:** -```php -0xD9 => new Instruction( - opcode: 0xD9, - mnemonic: 'RETI', - cycles: 16, - handler: static function (Cpu $cpu): int { - $low = $cpu->getBus()->readByte($cpu->getSP()->get()); - $cpu->getSP()->increment(); - $high = $cpu->getBus()->readByte($cpu->getSP()->get()); - $cpu->getSP()->increment(); - $cpu->getPC()->set(($high << 8) | $low); - // Missing: Should enable IME immediately! - return 16; - }, -), -``` - -**SameBoy Implementation:** -```c -static void reti(GB_gameboy_t *gb, uint8_t opcode) { - ret(gb, opcode); - gb->ime = true; // ← Immediate enable -} -``` - -**Our Implementation:** Missing the `gb->ime = true` line! - -**Issue:** RETI should enable interrupts immediately (not with 1-instruction delay like EI). - -**Impact:** Fixes 2 tests -- reti_timing ❌ -- reti_intr_timing ❌ - ---- - -### 3. DMA Start Delay Missing - -**Location:** `src/Dma/OamDma.php:119-146` - -**Problem:** DMA starts transferring bytes immediately in the same M-cycle that DMA is triggered. According to Pan Docs, there should be a 1 M-cycle delay before first byte transfer. - -**Expected Behavior:** -1. **Write to 0xFF46:** Triggers DMA -2. **Delay:** 1 M-cycle delay before first byte transfer -3. **Transfer:** 160 M-cycles to transfer 160 bytes -4. **Total:** 161 M-cycles from trigger to completion - -**Current Behavior:** Transfer starts immediately, no startup delay. - -**Impact:** Fixes 0-1 tests -- oam_dma_start ❌ (may already be fixed by Fix #1) - ---- - -### 4. Timer Uses Wrong Architecture - -**Location:** `src/Timer/Timer.php` - -**Problem:** Uses cycle accumulation instead of bit-selection from DIV counter. - -**Current Approach:** -```php -$this->timaCounter += $cycles; -if ($this->timaCounter >= $frequency) { - $this->incrementTima(); -} -``` - -**SameBoy Approach:** -- Uses bit-selection: `TAC_TRIGGER_BITS[] = {512, 8, 32, 128}` -- Detects falling edge: "TIMA increases when a specific high-bit becomes a low-bit" -- Implements 3-state reload machine (RUNNING → RELOADING → RELOADED) - -**Why This Matters:** -- Writing to DIV can trigger TIMA increment (glitch behavior) -- Changing TAC can cause immediate TIMA increment if selected bit was high -- TIMA reload takes 4 M-cycles, during which writes to TIMA/TMA have edge cases - -**Impact:** Affects 13 timer tests (requires medium-hard effort) - ---- - -### 5. CPU Executes Instructions Atomically - -**Location:** `src/Cpu/InstructionSet.php` (entire file) - -**Problem:** Instructions execute all-at-once and return total cycles. Real hardware performs operations incrementally over M-cycles. - -**Example - CALL instruction:** -- **Current:** Reads address, pushes PC, jumps → returns 24 cycles atomically -- **Real hardware:** 6 M-cycles with state changes at specific boundaries - -**SameBoy Approach:** -- Uses `cycle_read()` and `cycle_write()` with `pending_cycles = 4` -- Memory operations happen at specific M-cycle boundaries -- Enables accurate modeling of register access conflicts - -**Impact:** Affects 12 instruction timing tests (requires major refactor) - ---- - -## Comparison with SameBoy - -| Feature | PHPBoy | SameBoy | Impact | -|---------|--------|---------|--------| -| **Instruction execution** | Atomic | M-cycle stepping | 12 timing tests | -| **RETI enables IME** | ❌ No | ✅ Immediate | 2 tests | -| **DMA speed** | ❌ 4x too fast | ✅ Correct | 3 tests | -| **DMA start delay** | ❌ No delay | ✅ 1 M-cycle | 1 test | -| **Timer model** | Counter accumulation | Bit-selection + edge detection | 13 tests | -| **TIMA reload** | Instant | 4 M-cycle state machine | 3 tests | -| **Cycle tracking** | T-cycles | T-cycles with M-cycle conversion | Foundation | - ---- - -## Prioritized Fix Roadmap - -### Quick Wins (4-5 hours work, +7-8 tests) - -**Fix 1: OAM DMA Speed** ⭐ Easy (30 min, +2-3 tests) -- **File:** `src/Dma/OamDma.php:119` -- **Change:** Convert T-cycles to M-cycles before transferring -- **Expected:** 11-12/39 tests passing (28-31%) - -**Fix 2: RETI IME Enable** ⭐ Easy (15 min, +2 tests) -- **Files:** `src/Cpu/Cpu.php`, `src/Cpu/InstructionSet.php:3418` -- **Change:** Add `setIMEImmediate()` and call in RETI -- **Expected:** 13-14/39 tests passing (33-36%) - -**Fix 3: DMA Start Delay** ⭐⭐ Medium (1 hour, +0-1 tests) -- **File:** `src/Dma/OamDma.php` -- **Change:** Add 1 M-cycle delay before first byte transfer -- **Expected:** 14-17/39 tests passing (36-44%) - -**Combined Expected Result:** 16-18/39 tests passing (41-46%) - ---- - -### Medium Effort (8 hours work, +10-11 tests) - -**Fix 4: TIMA Reload Delay** ⭐⭐ Medium (2 hours, +3 tests) -- **File:** `src/Timer/Timer.php` -- **Change:** Implement 4 M-cycle reload state machine -- **Expected:** +3 tests (tima_reload, tima_write_reloading, tma_write_reloading) - -**Fix 5: Timer Bit-Selection** ⭐⭐⭐ Hard (4 hours, +4 tests) -- **File:** `src/Timer/Timer.php` -- **Change:** Rewrite timer to use 16-bit DIV counter with bit-selection -- **Expected:** +4 tests (tim*_div_trigger tests) - -**Combined Expected Result:** 20-22/39 tests passing (51-56%) - ---- - -### Major Refactor (24+ hours work, +12 tests) - -**Fix 6: M-Cycle Accurate CPU** ⭐⭐⭐⭐ Very Hard (16+ hours, +12 tests) -- **File:** `src/Cpu/InstructionSet.php` (complete rewrite) -- **Change:** Implement M-cycle stepping for all 256 instructions -- **Expected:** All 12 instruction timing tests - -**Expected Result:** 32-34/39 tests passing (82-87%) - ---- - -## Recommended Implementation Order - -**Phase 1 (Today):** Fixes 1-3 → Get to 41-46% with minimal risk - -**Phase 2 (This week):** Fixes 4-5 → Get to 51-56% with moderate effort - -**Phase 3 (Next sprint):** Fix 6 → Get to 82-87% with architectural changes - -**Remaining gaps:** -- PPU timing edge cases -- APU timing (not tested by current suite) -- Hardware quirks/glitches - ---- - -## Specific Code Changes - -### Fix 1: OAM DMA Speed - -**File:** `src/Dma/OamDma.php:119` - -```php -public function tick(int $cycles): void -{ - if (!$this->dmaActive) { - return; - } - - // Convert T-cycles to M-cycles (1 M-cycle = 4 T-cycles) - $mCycles = intdiv($cycles, 4); - - // Transfer one byte per M-cycle - for ($i = 0; $i < $mCycles; $i++) { - if ($this->dmaProgress >= self::TRANSFER_LENGTH) { - $this->dmaActive = false; - break; - } - - // Read from source and write to OAM - $sourceAddress = $this->dmaSource + $this->dmaProgress; - $destAddress = self::OAM_START + $this->dmaProgress; - $value = $this->bus->readByte($sourceAddress); - $this->bus->writeByte($destAddress, $value); - - $this->dmaProgress++; - } -} -``` - ---- - -### Fix 2: RETI IME Enable - -**File 1:** `src/Cpu/Cpu.php` (add new method) - -```php -/** - * Enable interrupts immediately (used by RETI). - * Unlike setIME(true), this enables IME without a 1-instruction delay. - */ -public function setIMEImmediate(): void -{ - $this->ime = true; - $this->imeDelay = 0; -} -``` - -**File 2:** `src/Cpu/InstructionSet.php:3407-3422` (update RETI handler) - -```php -0xD9 => new Instruction( - opcode: 0xD9, - mnemonic: 'RETI', - cycles: 16, - handler: static function (Cpu $cpu): int { - $low = $cpu->getBus()->readByte($cpu->getSP()->get()); - $cpu->getSP()->increment(); - $high = $cpu->getBus()->readByte($cpu->getSP()->get()); - $cpu->getSP()->increment(); - $cpu->getPC()->set(($high << 8) | $low); - - // RETI enables interrupts immediately (not delayed like EI) - $cpu->setIMEImmediate(); - - return 16; - }, -), -``` - ---- - -### Fix 3: DMA Start Delay - -**File:** `src/Dma/OamDma.php` (update class) - -```php -final class OamDma -{ - private const OAM_START = 0xFE00; - private const TRANSFER_LENGTH = 160; - - private bool $dmaActive = false; - private int $dmaSource = 0x0000; - private int $dmaProgress = 0; - private int $dmaDelay = 0; // ← Add delay counter - - // ... other methods ... - - private function startDmaTransfer(int $sourcePage): void - { - $this->dmaSource = ($sourcePage << 8) & 0xFF00; - $this->dmaActive = true; - $this->dmaProgress = 0; - $this->dmaDelay = 1; // ← 1 M-cycle delay before first byte - } - - public function tick(int $cycles): void - { - if (!$this->dmaActive) { - return; - } - - // Convert T-cycles to M-cycles (1 M-cycle = 4 T-cycles) - $mCycles = intdiv($cycles, 4); - - // Handle startup delay - if ($this->dmaDelay > 0) { - $delayToProcess = min($this->dmaDelay, $mCycles); - $this->dmaDelay -= $delayToProcess; - $mCycles -= $delayToProcess; - - if ($mCycles <= 0) { - return; // Still in delay phase - } - } - - // Transfer one byte per M-cycle - for ($i = 0; $i < $mCycles; $i++) { - if ($this->dmaProgress >= self::TRANSFER_LENGTH) { - $this->dmaActive = false; - break; - } - - // Read from source and write to OAM - $sourceAddress = $this->dmaSource + $this->dmaProgress; - $destAddress = self::OAM_START + $this->dmaProgress; - $value = $this->bus->readByte($sourceAddress); - $this->bus->writeByte($destAddress, $value); - - $this->dmaProgress++; - } - } -} -``` - ---- - -## Testing Strategy - -### After Each Fix - -```bash -# Test DMA fixes -make test -- --filter="oam_dma" - -# Test RETI fix -make test -- --filter="reti" - -# Full Mooneye suite -make test -- tests/Integration/MooneyeTestRomsTest.php - -# Check overall progress -make test -- tests/Integration/ -``` - -### Track Progress - -- **Baseline:** 9/39 (23%) -- **After Fix 1:** ~11-12/39 (28-31%) -- **After Fix 2:** ~13-14/39 (33-36%) -- **After Fix 3:** ~14-17/39 (36-44%) -- **Target (Quick Wins):** 16-18/39 (41-46%) - ---- - -## References - -### SameBoy Implementation Study - -**Timer (bit-selection model):** -- Uses `TAC_TRIGGER_BITS[] = {512, 8, 32, 128}` for bit positions -- Detects falling edge: "TIMA increases when a specific high-bit becomes a low-bit" -- Implements 3-state reload machine via `advance_tima_state_machine()` - -**CPU Timing:** -- Uses `cycle_read()` and `cycle_write()` with `pending_cycles = 4` -- M-cycle stepping, not atomic execution -- Hardware conflict simulation for accurate register access timing - -**RETI Instruction:** -```c -static void reti(GB_gameboy_t *gb, uint8_t opcode) { - ret(gb, opcode); - gb->ime = true; // Immediate enable -} -``` - -**DMA:** -- Explicit M-cycle tracking -- Startup delay before first byte transfer -- 160 M-cycles (640 T-cycles) for full transfer - -### Documentation - -- **Pan Docs:** https://gbdev.io/pandocs/ -- **SameBoy Source:** https://github.com/LIJI32/SameBoy -- **PHPBoy Research:** `docs/research.md` - ---- - -## Conclusion - -PHPBoy has a solid foundation with 100% Blargg test pass rate, proving instruction correctness. The Mooneye failures are all **timing accuracy** issues, not logic bugs. - -The quick wins (Fixes 1-3) are low-risk, high-impact changes that can nearly double the Mooneye pass rate in a single afternoon. These fixes align PHPBoy's timing model with SameBoy's proven approach. - -**Next Steps:** -1. Implement Fix 1 (DMA speed) - 30 minutes -2. Implement Fix 2 (RETI IME) - 15 minutes -3. Implement Fix 3 (DMA delay) - 1 hour -4. Run full test suite and measure improvement -5. Commit with detailed conventional commit message - -**Long-term:** -- Medium effort fixes (4-5) → 51-56% pass rate -- Major CPU refactor (6) → 82-87% pass rate -- Final edge cases → 90-100% pass rate diff --git a/docs/STATUS.md b/docs/STATUS.md index c278cc5..6f200f0 100644 --- a/docs/STATUS.md +++ b/docs/STATUS.md @@ -201,7 +201,6 @@ This document tracks the implementation status of the PHPBoy Game Boy Color emul - **Deliverables**: - ✅ **Profiling Infrastructure**: Xdebug profiling with cachegrind output - ✅ **Benchmark Tooling**: `make benchmark`, `make benchmark-jit`, `make profile`, `make memory-profile` - - ✅ **Profiling Analysis**: Expected hotspots documented in `docs/profiling-analysis.md` - ✅ **Optimizations Applied**: - Inline instruction decode/execute in `Cpu::step()` (+3-7% expected) - Pre-build instruction cache with `InstructionSet::warmCache()` (+1-2% expected) diff --git a/docs/performance.md b/docs/performance.md index 71ba0bb..3b13e46 100644 --- a/docs/performance.md +++ b/docs/performance.md @@ -193,7 +193,6 @@ After each optimization: ## Related Documents -- **[Profiling Analysis](profiling-analysis.md):** Expected hotspots and optimization priorities - **[Optimizations Log](optimizations.md):** Detailed implementation notes for all optimizations - **[Test Results](test-results.md):** ROM compatibility and test pass rates - **[Status](STATUS.md):** Overall project status and step completion diff --git a/docs/profiling-analysis.md b/docs/profiling-analysis.md deleted file mode 100644 index 5ab42bf..0000000 --- a/docs/profiling-analysis.md +++ /dev/null @@ -1,329 +0,0 @@ -# PHPBoy Profiling Analysis - -**Last Updated:** 2025-11-09 -**Status:** Expected Hotspots (Profiling infrastructure ready, requires Docker to run) - -## Overview - -This document analyzes expected performance hotspots in PHPBoy based on emulator architecture and common PHP performance patterns. Once profiling data is available via `make profile ROM= FRAMES=1000`, this document will be updated with actual measurements. - -## Expected Hot Paths - -Based on the Game Boy architecture and emulator implementation: - -### 1. CPU Instruction Dispatch (CRITICAL PATH) -**Expected Impact:** 40-50% of total execution time - -**Hot Spots:** -- `Cpu::step()` - Called every instruction (60 FPS × 154 scanlines × ~114 instructions/line ≈ 1M calls/second) -- `Cpu::fetch()` - Memory bus read for every instruction -- `Cpu::execute()` - Closure invocation overhead -- `InstructionSet::getInstruction()` - Array lookup (cached, but still called every instruction) -- Instruction handler closures - 256 base + 256 CB opcodes = 512 handlers - -**Current Optimizations:** -- ✅ Lazy instruction building with static caching (`self::$instructions`) -- ✅ Direct closure invocation in `execute()` - -**Remaining Opportunities:** -- Eliminate `decode()` method call overhead (inline instruction lookup) -- Pre-build all instructions on initialization (trade memory for CPU) -- Consider match expression vs array lookup for opcode dispatch - -### 2. Memory Bus Access (CRITICAL PATH) -**Expected Impact:** 25-35% of total execution time - -**Hot Spots:** -- `SystemBus::readByte()` - Called for every instruction fetch + every memory load -- `SystemBus::writeByte()` - Called for every memory store -- Memory region routing (VRAM, WRAM, HRAM, I/O, cartridge) -- MBC (Memory Bank Controller) logic - -**Frequency:** -- ~1M instruction fetches/second -- ~500K additional memory operations/second (loads/stores) -- Total: ~1.5M bus accesses/second - -**Current Implementation:** -- Assumed: Interface-based routing via if/elseif chains or match expressions - -**Opportunities:** -- Inline fast paths for common memory regions -- Direct array access for WRAM/HRAM (avoid method calls) -- Cache frequently accessed I/O registers - -### 3. PPU Rendering (CRITICAL PATH) -**Expected Impact:** 15-20% of total execution time - -**Hot Spots:** -- `Ppu::step()` - Called every CPU cycle (4.19 MHz / 4 = ~1M/second) -- Pixel fetching and rendering (mode 3) -- Tile data lookups in VRAM -- Sprite evaluation (OAM search) -- Palette color conversion - -**Current Implementation:** -- Simplified pixel transfer timing (172 dots fixed) -- Scanline buffer rendering - -**Opportunities:** -- Lazy evaluation: only render when scanline completes -- Cache tile data between frames (tiles rarely change) -- Optimize color palette lookups with array indexing - -### 4. Flag Register Synchronization -**Expected Impact:** 5-10% of total execution time - -**Hot Spots:** -- `FlagRegister::syncToAF()` - Called after every flag modification -- `FlagRegister::syncFromAF()` - Called after `POP AF` -- Bit manipulation for Z, N, H, C flags - -**Frequency:** -- ~50% of instructions modify flags -- ~500K flag sync operations/second - -**Current Implementation:** -- Two-way synchronization between FlagRegister object and AF Register16 - -**Opportunities:** -- Inline flag operations (avoid method call overhead) -- Lazy synchronization: only sync when AF is read/written -- Direct bit manipulation on AF low byte - -### 5. Clock Tracking -**Expected Impact:** 3-5% of total execution time - -**Hot Spots:** -- `Clock::tick()` - Called after every CPU instruction -- Timer updates based on clock cycles -- PPU/APU synchronization - -**Opportunities:** -- Inline clock accumulation (avoid method call) -- Batch timer updates (every 4-16 instructions vs every instruction) - -## Optimization Priorities - -Based on expected impact and implementation effort: - -### Priority 1: High Impact, Medium Effort - -1. **Inline Instruction Decode** (`Cpu::step()`) - - Current: `$instruction = $this->decode($opcode);` - - Optimized: `$instruction = self::$instructions[$opcode] ?? self::buildInstruction($opcode);` - - Expected: 2-5% performance gain - - Eliminates one method call per instruction - -2. **Pre-build Instruction Cache** - - Current: Lazy building on first access - - Optimized: Build all 512 instructions on `InstructionSet` initialization - - Expected: 1-2% performance gain (eliminates isset check) - - Trade-off: ~100KB additional memory for faster execution - -3. **Inline Memory Fast Paths** (`SystemBus`) - - Current: All memory access through `readByte()`/`writeByte()` - - Optimized: Direct array access for WRAM/HRAM - - Expected: 5-10% performance gain - - Example: - ```php - // Fast path for WRAM (0xC000-0xDFFF) - if ($address >= 0xC000 && $address <= 0xDFFF) { - return $this->wram[$address - 0xC000]; - } - ``` - -### Priority 2: Medium Impact, Low Effort - -4. **Enable OPcache** (Already implemented in Dockerfile) - - Expected: 10-15% performance gain - - Zero code changes required - - Verify with: `php -i | grep opcache` - -5. **Enable PHP 8.5 JIT** - - Expected: 20-40% performance gain for hot loops - - Configuration: `opcache.jit=tracing`, `opcache.jit_buffer_size=100M` - - Test with: `make benchmark-jit` - -6. **Reduce Flag Sync Overhead** - - Current: Two-way sync on every flag operation - - Optimized: Lazy sync only when AF is accessed directly - - Expected: 3-5% performance gain - -### Priority 3: Lower Impact or Higher Risk - -7. **Cache Tile Data** - - Pre-decode tiles to pixel arrays on VRAM write - - Expected: 2-5% performance gain - - Risk: Increases memory usage significantly - -8. **Lookup Tables for Flag Calculations** - - Pre-compute half-carry and carry flags for common operations - - Expected: 2-3% performance gain - - Trade-off: Memory vs CPU - -## PHP-Specific Optimizations - -### 1. Object Allocation Reduction - -**Current:** Many small objects created per frame (Register8, Color, etc.) - -**Optimization:** Use primitives (int, array) where possible - -**Example:** -```php -// Before: $color = new Color($r, $g, $b); -// After: $color = ($r << 16) | ($g << 8) | $b; -``` - -**Expected Impact:** 5-10% performance gain, reduced GC pressure - -### 2. Property Access Optimization - -**Current:** Accessing properties through getters (`$cpu->getA()`) - -**Optimization:** Direct property access in hot paths (make properties public or use readonly) - -**Trade-off:** Breaks encapsulation, but PHP property access is slower than C#/Java - -### 3. Method Call Reduction - -**Current:** Many small methods called millions of times - -**Optimization:** Inline critical methods (especially getters/setters) - -**Expected Impact:** 5-10% performance gain - -### 4. Array Access Optimization - -**Current:** Associative arrays with string keys - -**Optimization:** Use integer-indexed arrays where possible - -**Example:** -```php -// Before: $registers = ['A' => 0, 'B' => 0, ...]; -// After: $registers = [0, 0, ...]; // Use constants for indices -``` - -## Measurement Strategy - -When profiling infrastructure is available: - -1. **Baseline Measurement** - ```bash - make benchmark ROM=third_party/roms/commercial/tetris.gb FRAMES=3600 - ``` - - Record FPS, memory usage - - Establish baseline for comparison - -2. **Profiling Session** - ```bash - make profile ROM=third_party/roms/commercial/tetris.gb FRAMES=1000 - kcachegrind var/profiling/cachegrind.out.* - ``` - - Identify actual top 10 hotspots - - Compare with expected hotspots above - -3. **Optimization Cycle** - For each optimization: - - Apply optimization - - Run benchmark - - Calculate performance delta - - Run `make test` to verify correctness - - Document in `docs/optimizations.md` - -4. **JIT Testing** - ```bash - # Baseline (no JIT) - make benchmark ROM=tetris.gb FRAMES=3600 - - # With JIT - make benchmark-jit ROM=tetris.gb FRAMES=3600 - - # Compare FPS improvement - ``` - -## Expected Performance Gains - -Conservative estimates for cumulative optimizations: - -| Optimization | Expected Gain | Cumulative FPS | -|--------------|---------------|----------------| -| Baseline (Step 13) | - | 27.5 FPS (46%) | -| OPcache enabled | +12% | 30.8 FPS (51%) | -| Inline decode | +3% | 31.7 FPS (53%) | -| Memory fast paths | +7% | 33.9 FPS (57%) | -| Pre-built instructions | +2% | 34.6 FPS (58%) | -| Flag sync optimization | +4% | 36.0 FPS (60%) | -| **PHP 8.5 JIT** | **+30%** | **46.8 FPS (78%)** | -| Object allocation reduction | +8% | 50.5 FPS (84%) | - -**Target Achievement:** -- ✅ Minimum (30 FPS): Already achieved at baseline -- 🎯 Target (60 FPS): Achievable with OPcache + JIT + optimizations -- ⏸️ Stretch (120 FPS): Unlikely in pure PHP, may require native extensions - -## Risk Assessment - -### Low Risk (Safe to Apply) -- ✅ OPcache: Standard PHP optimization, zero code changes -- ✅ Instruction cache pre-building: Pure performance optimization -- ✅ JIT: Can be toggled on/off, no code changes - -### Medium Risk (Test Thoroughly) -- ⚠️ Inline decode: Minor refactoring, maintain test coverage -- ⚠️ Memory fast paths: Ensure bus routing logic remains correct -- ⚠️ Flag sync optimization: Critical for correctness, extensive testing required - -### High Risk (Prototype First) -- 🔴 Object allocation changes: May break type safety -- 🔴 Breaking encapsulation: Makes code harder to maintain -- 🔴 Native extensions (FFI): Platform-specific, complex build - -## Recommendations - -1. **Start with OPcache:** Already configured, just needs verification -2. **Test JIT:** Biggest potential gain with zero code changes -3. **Profile first:** Confirm expected hotspots match reality -4. **Incremental optimizations:** Apply one at a time, measure impact -5. **Maintain correctness:** All tests must pass after each optimization -6. **Document everything:** Track what works, what doesn't, and why - -## Tools and Commands - -```bash -# Build Docker image with profiling support -make rebuild - -# Run baseline benchmark -make benchmark ROM=third_party/roms/commercial/tetris.gb FRAMES=3600 - -# Run with profiling -make profile ROM=third_party/roms/commercial/tetris.gb FRAMES=1000 - -# Analyze profile data -kcachegrind var/profiling/cachegrind.out.* - -# Test with JIT -make benchmark-jit ROM=third_party/roms/commercial/tetris.gb FRAMES=3600 - -# Memory profiling -make memory-profile ROM=third_party/roms/commercial/tetris.gb FRAMES=1000 - -# Verify tests still pass -make test - -# Verify lint passes -make lint -``` - -## Next Steps - -1. Build Docker image with updated Dockerfile -2. Run baseline benchmark to establish current FPS -3. Run profiling session to identify actual hotspots -4. Update this document with real profiling data -5. Apply optimizations in priority order -6. Measure impact of each optimization -7. Document results in `docs/optimizations.md`