diff --git a/STEP4_STATUS.md b/STEP4_STATUS.md
deleted file mode 100644
index e1189a6..0000000
--- a/STEP4_STATUS.md
+++ /dev/null
@@ -1,160 +0,0 @@
-# Step 4 - Core Instruction Set Implementation - STATUS
-
-## ✅ COMPLETED COMPONENTS
-
-### 1. Full Instruction Set Implementation (100%)
-- **All 256 base opcodes** (0x00-0xFF) - ✅ COMPLETE
-- **All 256 CB-prefixed opcodes** (0xCB00-0xCBFF) - ✅ COMPLETE
-- **Total: 512 instructions** with full handlers
-- **File**: `src/Cpu/InstructionSet.php` (4,130 lines)
-
-#### Instruction Categories Implemented:
-- ✅ Load instructions (LD r,r | LD r,n | LD r,(HL) | LDH | LDI | LDD)
-- ✅ 8-bit ALU (ADD, ADC, SUB, SBC, AND, XOR, OR, CP)
-- ✅ 16-bit arithmetic (ADD HL,rr | INC rr | DEC rr)
-- ✅ 8-bit INC/DEC for all registers
-- ✅ Stack operations (PUSH/POP for BC, DE, HL, AF)
-- ✅ Control flow (JP, JR, CALL, RET, RETI, RST)
-- ✅ Special operations (DAA, CPL, CCF, SCF, HALT, STOP, DI, EI)
-- ✅ CB bit operations (BIT, SET, RES)
-- ✅ CB rotates/shifts (RLC, RRC, RL, RR, SLA, SRA, SRL, SWAP)
-
-### 2. CPU Infrastructure (100%)
-- ✅ CPU state management (halted, stopped, IME flags)
-- ✅ Helper methods (readImm8/readImm16, halfCarry detection)
-- ✅ FlagRegister with convenience aliases (getZ/setZ, getN/setN, getH/setH, getC/setC)
-- ✅ Proper cycle counting for all instructions
-- ✅ Conditional branch cycle timing (taken vs not-taken)
-
-### 3. Comprehensive Unit Tests (100%)
-- ✅ **60+ test cases** covering all instruction categories
-- ✅ File: `tests/Unit/Cpu/InstructionSetTest.php` (917 lines)
-
-#### Test Coverage:
-- ✅ 8-bit and 16-bit load instructions
-- ✅ All ALU operations with flag verification
-- ✅ INC/DEC edge cases (overflow, underflow, half-carry)
-- ✅ 16-bit arithmetic with carry propagation
-- ✅ DAA (Decimal Adjust) - multiple scenarios (addition, subtraction, carry)
-- ✅ Special operations (CPL, SCF, CCF)
-- ✅ Rotate operations (RLCA, RRCA, RLA, RRA)
-- ✅ Stack operations (PUSH/POP with proper byte ordering)
-- ✅ Jump operations with cycle timing verification
-- ✅ CALL/RET/RST with stack verification
-- ✅ CB-prefixed instructions (rotates, BIT, SET, RES, SWAP)
-- ✅ HALT/STOP state management
-- ✅ Interrupt control (DI/EI)
-
-### 4. Test Infrastructure
-- ✅ MockBus implementation for testing (`src/Bus/MockBus.php`)
-- ✅ Existing CPU core tests (`tests/Unit/Cpu/CpuTest.php`)
-- ✅ Blargg test ROMs available (`third_party/roms/cpu_instrs/`)
-
-## ⏸️ PENDING COMPONENTS (Require Docker/Full System)
-
-### 1. Test Execution (Cannot run without Docker)
-- ⏸️ Run `make test` to execute PHPUnit tests
-- ⏸️ Verify all 60+ unit tests pass
-- **Blocker**: Docker not available in current environment
-- **Command**: `make test`
-
-### 2. Static Analysis (Cannot run without Docker)
-- ⏸️ Run `make lint` (PHPStan level 9)
-- ⏸️ Fix any type errors or static analysis issues
-- **Blocker**: Docker not available in current environment
-- **Command**: `make lint`
-
-### 3. Blargg ROM Tests (Require Full System)
-- ⏸️ All 11 Blargg cpu_instrs test ROMs
-  - 01-special.gb
-  - 02-interrupts.gb (requires interrupt handling - Step 6)
-  - 03-op sp,hl.gb
-  - 04-op r,imm.gb
-  - 05-op rp.gb
-  - 06-ld r,r.gb
-  - 07-jr,jp,call,ret,rst.gb
-  - 08-misc instrs.gb
-  - 09-op r,r.gb
-  - 10-bit ops.gb
-  - 11-op a,(hl).gb
-- **Blockers**:
-  - Requires Docker environment
-  - Requires complete memory bus (Step 5)
-  - Requires interrupt handling for some tests (Step 6)
-  - Requires serial output to read test results
-
-## 📊 COMPLETION METRICS
-
-| Component | Status | Progress |
-|-----------|--------|----------|
-| Instruction Implementation | ✅ Complete | 512/512 (100%) |
-| Unit Tests Written | ✅ Complete | 60+ tests |
-| Unit Tests Executed | ⏸️ Pending | Requires Docker |
-| Static Analysis | ⏸️ Pending | Requires Docker |
-| Blargg ROM Tests | ⏸️ Pending | Requires Steps 5-6 |
-
-## 🎯 DEFINITION OF DONE (Per PLAN.md)
-
-### ✅ Completed Requirements:
-- [x] All 256 base opcodes implemented (0x00-0xFF)
-- [x] CB-prefixed instructions implemented (0xCB00-0xCBFF)
-- [x] All instruction categories implemented (loads, ALU, 16-bit, jumps, special)
-- [x] Flag handling implemented correctly (Z, N, H, C)
-- [x] Unit tests exist for complex instructions (DAA, flags, 16-bit arithmetic)
-- [x] HALT instruction properly implemented
-- [x] STOP instruction implemented
-- [x] Cycle counting accurate
-
-### ⏸️ Pending Requirements:
-- [ ] `make test` passes with 100% pass rate (requires Docker)
-- [ ] `make lint` passes with 0 errors (requires Docker)
-- [ ] Blargg cpu_instrs test ROMs pass (requires Steps 5-6)
-
-## 🔄 NEXT STEPS
-
-### Option A: Continue Step 4 (Requires Environment Setup)
-1. Set up Docker environment
-2. Run `make test` and verify all tests pass
-3. Run `make lint` and fix any issues
-4. Implement minimal memory bus for ROM testing
-5. Run Blargg test ROMs and fix failures
-
-### Option B: Proceed to Step 5 (Recommended)
-1. Implement complete Memory Map & Bus (Step 5)
-2. Implement Interrupts & Timers (Step 6)
-3. Return to Step 4 for Blargg ROM validation
-4. This provides the infrastructure needed for full system testing
-
-## 📈 CODE STATISTICS
-
-```
-Instruction Set:     4,130 lines
-Unit Tests:            917 lines
-Total Test Coverage:   60+ test cases
-CPU Infrastructure:    371 lines
-Helper Utilities:      122 lines (BitOps)
-```
-
-## 🔗 RELATED FILES
-
-- `src/Cpu/InstructionSet.php` - Complete instruction set implementation
-- `src/Cpu/Cpu.php` - CPU core with state management
-- `src/Cpu/Instruction.php` - Instruction metadata structure
-- `src/Bus/MockBus.php` - Test memory bus
-- `tests/Unit/Cpu/InstructionSetTest.php` - Comprehensive instruction tests
-- `tests/Unit/Cpu/CpuTest.php` - Basic CPU tests
-- `third_party/roms/cpu_instrs/` - Blargg test ROMs (11 files)
-
-## 📝 COMMITS
-
-1. `377dd3d` - WIP: Partial implementation (opcodes 0x00-0x87)
-2. `0fd4eee` - Complete implementation of all 512 instructions
-3. `f976c8f` - Add comprehensive instruction set unit tests
-
-## ✅ CONCLUSION
-
-**Step 4 core implementation is functionally complete.** All 512 CPU instructions are implemented with proper flag handling, cycle counting, and comprehensive unit tests. The remaining work (test execution, linting, ROM validation) requires either:
-1. A Docker environment to run the existing test suite, or
-2. Completion of Steps 5-6 to provide the full system infrastructure
-
-The instruction set is ready for integration testing once the memory bus and interrupt system are implemented.
diff --git a/docs/QUICK_WINS_ANALYSIS.md b/docs/QUICK_WINS_ANALYSIS.md
deleted file mode 100644
index a2d7e6d..0000000
--- a/docs/QUICK_WINS_ANALYSIS.md
+++ /dev/null
@@ -1,231 +0,0 @@
-# Quick Wins Implementation - Post-Mortem Analysis
-
-**Date:** 2025-11-08
-**Implementation Time:** ~45 minutes
-**Result:** 9/39 tests passing (23%) - **NO IMPROVEMENT**
-
-## What We Fixed
-
-### Fix 1: OAM DMA Speed ✅ Implemented
-**File:** `src/Dma/OamDma.php:133`
-**Change:** Convert T-cycles to M-cycles before transferring
-```php
-$mCycles = intdiv($cycles, 4);  // Convert T-cycles to M-cycles
-```
-
-### Fix 2: RETI IME Enable ✅ Implemented
-**Files:**
-- `src/Cpu/Cpu.php:454` - Added `setIMEImmediate()` method
-- `src/Cpu/InstructionSet.php:3419` - Call `setIMEImmediate()` in RETI handler
-
-**Change:** RETI now enables interrupts immediately instead of with 1-instruction delay
-
-### Fix 3: DMA Start Delay ✅ Implemented
-**File:** `src/Dma/OamDma.php:56,106,136-143`
-**Change:** Added 1 M-cycle delay before first byte transfer
-```php
-private int $dmaDelay = 0;
-// ... in startDmaTransfer:
-$this->dmaDelay = 1;
-```
-
-## Why These Fixes Didn't Improve Test Results
-
-### Root Cause: M-Cycle Granularity Required
-
-The Mooneye tests that are failing **all require M-cycle accurate CPU execution**. Even though our fixes are technically correct, they don't address the fundamental issue: instructions execute atomically instead of incrementally over M-cycles.
-
-### Example: RETI Timing Test
-
-Even with RETI enabling IME immediately, the `reti_timing.gb` test fails because:
-
-1. **What the test checks:** Whether interrupts fire at EXACT cycle boundaries after RETI
-2. **What we fixed:** IME is enabled immediately (correct)
-3. **What's still wrong:** The RETI instruction executes all 16 cycles atomically:
-   - M-cycle 1: Pop low byte from stack
-   - M-cycle 2: Pop high byte from stack
-   - M-cycle 3: Internal delay
-   - M-cycle 4: Jump to address
-
-**The test expects** state changes to happen between M-cycles, but our CPU returns all 16 cycles at once and processes everything atomically.
-
-### Example: OAM DMA Timing Test
-
-Even with DMA speed fixed and start delay added, the `oam_dma_timing.gb` test fails because:
-
-1. **What the test checks:** DMA blocks CPU access at EXACT cycle boundaries
-2. **What we fixed:** DMA takes correct number of cycles (correct)
-3. **What's still wrong:** DMA processes all cycles for a batch at once, not incrementally
-
-The test might write to a register at cycle 100, start DMA at cycle 104, and check if the write succeeded at cycle 108. If DMA processes 4 cycles atomically, the timing of when the CPU is blocked doesn't align with the test's expectations.
-
-### Example: Instruction Timing Tests
-
-Tests like `call_timing.gb`, `jp_timing.gb`, etc. all fail because:
-
-1. **What they check:** Interrupt dispatch between M-cycle boundaries during multi-cycle instructions
-2. **What we have:** Atomic instruction execution
-3. **What's needed:** Instructions that execute incrementally, allowing interrupts to fire mid-instruction
-
-## What We Learned
-
-### Our Fixes Are CORRECT But INSUFFICIENT
-
-| Fix | Correctness | Impact | Reason |
-|-----|-------------|---------|---------|
-| DMA Speed | ✅ Correct | ❌ No impact | Tests need M-cycle granular CPU |
-| RETI IME | ✅ Correct | ❌ No impact | Tests need M-cycle granular CPU |
-| DMA Delay | ✅ Correct | ❌ No impact | Tests need M-cycle granular CPU |
-
-### The Mooneye Tests Are EXTREMELY Precise
-
-All failing tests show the "fail signature" (all registers = 0x42), which means:
-- The tests ARE running correctly
-- The tests ARE detecting failures properly
-- The tests ARE failing due to **timing precision**, not logic bugs
-
-### The Tests Check Sub-Instruction Timing
-
-Mooneye tests verify behavior at **M-cycle boundaries**, not just instruction boundaries. Examples:
-
-**`call_timing.gb`**: Checks if interrupts can fire between reading the address bytes and pushing PC to stack.
-
-**`oam_dma_start.gb`**: Checks if CPU can still access memory during the 1 M-cycle delay before DMA transfer starts.
-
-**`ei_sequence.gb`**: Checks the exact cycle when IME becomes enabled relative to instruction boundaries.
-
-These checks are IMPOSSIBLE to satisfy with atomic instruction execution.
-
-## The Real Problem: Architecture Limitation
-
-### Current CPU Architecture
-```
-CPU.step() {
-    fetch opcode        // 4 cycles
-    execute instruction // N cycles, ALL AT ONCE
-    return total_cycles // e.g., 24 for CALL
-}
-```
-
-**Problem:** Everything between "fetch" and "return" happens atomically. Other components (interrupts, DMA, PPU) only get to run AFTER the full instruction completes.
-
-### Required CPU Architecture
-```
-CPU.step() {
-    if (m_cycle == 0) fetch opcode       // 4 cycles
-    if (m_cycle == 1) read byte 1        // 4 cycles, interrupt can fire here
-    if (m_cycle == 2) read byte 2        // 4 cycles, interrupt can fire here
-    if (m_cycle == 3) internal delay     // 4 cycles, interrupt can fire here
-    if (m_cycle == 4) push PCH           // 4 cycles, interrupt can fire here
-    if (m_cycle == 5) push PCL           // 4 cycles, return
-}
-```
-
-**Benefit:** Each M-cycle is a discrete event. Interrupts can fire between M-cycles. Other components can run between M-cycles.
-
-## Correct Implementation Order
-
-### Phase 0: Prerequisites (DONE ✅)
-- Fix DMA speed (T-cycle → M-cycle) ✅
-- Fix RETI IME enable ✅
-- Fix DMA start delay ✅
-
-These fixes are **necessary but not sufficient**. They don't improve test results YET, but they're required for accuracy.
-
-### Phase 1: M-Cycle Accurate CPU (REQUIRED)
-**Estimated effort:** 24-40 hours
-**Impact:** +12-20 tests (to ~50-75% pass rate)
-
-Without this, NO timing tests will pass.
-
-### Phase 2: Timer Bit-Selection Model (REQUIRED)
-**Estimated effort:** 4-6 hours
-**Impact:** +4-8 tests (to ~60-85% pass rate)
-
-Depends on Phase 1 being complete.
-
-### Phase 3: TIMA Reload State Machine (REQUIRED)
-**Estimated effort:** 2-3 hours
-**Impact:** +2-4 tests (to ~65-90% pass rate)
-
-Depends on Phase 1 and 2.
-
-## Recommendation
-
-### Commit These Changes
-
-Even though they don't improve test results, they ARE correct fixes and should be committed because:
-
-1. **Correctness:** They align with SameBoy's implementation
-2. **Prerequisites:** Required for M-cycle accurate CPU
-3. **No Regression:** Blargg tests still 100% passing
-4. **Code Quality:** Better documented, more accurate
-
-### Next Steps
-
-**Option A: Continue with M-Cycle CPU (Ambitious)**
-- Commit quick wins
-- Start M-cycle CPU refactor
-- Expected timeline: 3-5 days of focused work
-
-**Option B: Document and Defer (Pragmatic)**
-- Commit quick wins
-- Update PLAN.md with accurate implementation requirements
-- Defer timing accuracy improvements to future sprint
-
-## Commit Message
-
-```
-fix(timing): correct DMA speed, RETI IME, and DMA start delay
-
-What was implemented:
-- OAM DMA now correctly converts T-cycles to M-cycles (was 4x too fast)
-- RETI instruction enables IME immediately (not delayed like EI)
-- OAM DMA has proper 1 M-cycle startup delay before first byte
-
-Why this approach:
-- Aligns with SameBoy reference implementation
-- Necessary prerequisites for M-cycle accurate CPU
-- Fixes are correct even though tests don't pass yet
-
-Test results:
-- Blargg: 12/12 passing (100%) ✅ No regression
-- Mooneye: 9/39 passing (23%) - No change (expected)
-- Mooneye tests require M-cycle accurate CPU to pass
-
-Technical details:
-- DMA tick() now uses intdiv($cycles, 4) to convert T-cycles
-- Added Cpu::setIMEImmediate() for RETI vs EI distinction
-- DMA startup delay tracked in $dmaDelay property
-
-References:
-- docs/ROM_CHECK_ANALYSIS.md
-- SameBoy: https://github.com/LIJI32/SameBoy
-- Pan Docs: Timing section
-```
-
-## Test Results Before/After
-
-### Before
-- Blargg: 12/12 (100%)
-- Mooneye: 9/39 (23%)
-
-### After
-- Blargg: 12/12 (100%) ✅ No regression
-- Mooneye: 9/39 (23%) - No change (expected)
-
-## Conclusion
-
-The "quick wins" are correctly implemented but have zero impact on test results because:
-
-1. **All failing Mooneye tests require M-cycle granularity**
-2. **Our CPU executes instructions atomically**
-3. **No amount of peripheral fixes will help without CPU refactor**
-
-The fixes should still be committed as they're technically correct and necessary prerequisites for future timing accuracy work. The path to 100% Mooneye pass rate goes through M-cycle accurate CPU execution - there's no shortcut.
-
----
-
-**Status:** Ready to commit (with updated expectations)
-**Next Major Task:** M-cycle accurate CPU architecture
-**Realistic Timeline:** 3-5 days for 50-85% Mooneye pass rate
diff --git a/docs/ROM_CHECK_ANALYSIS.md b/docs/ROM_CHECK_ANALYSIS.md
deleted file mode 100644
index a5c8fda..0000000
--- a/docs/ROM_CHECK_ANALYSIS.md
+++ /dev/null
@@ -1,539 +0,0 @@
-# PHPBoy ROM Check Test Analysis
-
-**Date:** 2025-11-08
-**Current Status:** 9/39 Mooneye tests passing (23%), 12/12 Blargg tests passing (100%)
-
-## Executive Summary
-
-PHPBoy has excellent **instruction correctness** (100% Blargg pass rate) but lacks **timing accuracy** (23% Mooneye pass rate). The main issues are:
-
-1. **DMA runs 4x too fast** - treating T-cycles as M-cycles
-2. **RETI doesn't enable interrupts** - missing IME enable
-3. **Timer uses wrong model** - counter accumulation vs bit-selection
-4. **CPU executes atomically** - no M-cycle boundaries
-
-Quick wins (Fixes 1-3) can improve from 23% to 41-46% with ~5 hours of work.
-
----
-
-## Current Test Results
-
-### Blargg CPU Instruction Tests: 12/12 ✅
-- 01-special.gb ✅
-- 02-interrupts.gb ✅
-- 03-op sp,hl.gb ✅
-- 04-op r,imm.gb ✅
-- 05-op rp.gb ✅
-- 06-ld r,r.gb ✅
-- 07-jr,jp,call,ret,rst.gb ✅
-- 08-misc instrs.gb ✅
-- 09-op r,r.gb ✅
-- 10-bit ops.gb ✅
-- 11-op a,(hl).gb ✅
-- instr_timing.gb ✅
-
-### Mooneye Acceptance Tests: 9/39 (23%)
-
-**Passing (9 tests):**
-- di_timing-GS ✅
-- halt_ime0_ei ✅
-- halt_ime0_nointr_timing ✅
-- halt_ime1_timing ✅
-- if_ie_registers ✅
-- intr_timing ✅
-- instr/daa ✅
-- timer/tim00_div_trigger ✅
-- timer/tim01 ✅
-- timer/tim11_div_trigger ✅
-
-**Failing by Category:**
-
-**Instruction Timing (12 tests):**
-- add_sp_e_timing ❌
-- call_cc_timing ❌
-- call_timing ❌
-- jp_cc_timing ❌
-- jp_timing ❌
-- ld_hl_sp_e_timing ❌
-- pop_timing ❌
-- push_timing ❌
-- ret_cc_timing ❌
-- ret_timing ❌
-- reti_timing ❌
-- rst_timing ❌
-
-**Interrupt/EI Timing (4 tests):**
-- ei_sequence ❌
-- ei_timing ❌
-- rapid_di_ei ❌
-- reti_intr_timing ❌
-
-**OAM DMA Timing (3 tests):**
-- oam_dma_restart ❌
-- oam_dma_start ❌
-- oam_dma_timing ❌
-
-**Timer Edge Cases (13 tests):**
-- timer/div_write ❌
-- timer/rapid_toggle ❌
-- timer/tim00 ❌
-- timer/tim01_div_trigger ❌
-- timer/tim10 ❌
-- timer/tim10_div_trigger ❌
-- timer/tim11 ❌
-- timer/tima_reload ❌
-- timer/tima_write_reloading ❌
-- timer/tma_write_reloading ❌
-
----
-
-## Root Cause Analysis
-
-### 1. OAM DMA Running 4x Too Fast (CRITICAL BUG)
-
-**Location:** `src/Dma/OamDma.php:119`
-
-**Problem:** DMA `tick()` receives T-cycles from CPU but treats them as M-cycles, making DMA complete in 160 T-cycles instead of 640 T-cycles (160 M-cycles).
-
-**Current Implementation:**
-```php
-public function tick(int $cycles): void
-{
-    if (!$this->dmaActive) {
-        return;
-    }
-
-    // Transfer one byte per M-cycle
-    for ($i = 0; $i < $cycles; $i++) {  // ← Treats $cycles as M-cycles
-        if ($this->dmaProgress >= self::TRANSFER_LENGTH) {
-            $this->dmaActive = false;
-            break;
-        }
-        // ... transfer logic
-    }
-}
-```
-
-**Call Site (Emulator.php:309-310):**
-```php
-$cycles = $this->cpu->step();  // Returns T-cycles (e.g., 4, 8, 12)
-$this->oamDma?->tick($cycles);  // ← Passes T-cycles
-```
-
-**Issue:** The loop iterates `$cycles` times (T-cycles), but should iterate `$cycles / 4` times (M-cycles).
-
-**Impact:** Fixes ~2-3 tests
-- oam_dma_timing ❌
-- oam_dma_start ❌
-
----
-
-### 2. RETI Doesn't Enable Interrupts (CRITICAL BUG)
-
-**Location:** `src/Cpu/InstructionSet.php:3407-3422`
-
-**Problem:** RETI instruction returns from interrupt but doesn't set IME flag.
-
-**Current Implementation:**
-```php
-0xD9 => new Instruction(
-    opcode: 0xD9,
-    mnemonic: 'RETI',
-    cycles: 16,
-    handler: static function (Cpu $cpu): int {
-        $low = $cpu->getBus()->readByte($cpu->getSP()->get());
-        $cpu->getSP()->increment();
-        $high = $cpu->getBus()->readByte($cpu->getSP()->get());
-        $cpu->getSP()->increment();
-        $cpu->getPC()->set(($high << 8) | $low);
-        // Missing: Should enable IME immediately!
-        return 16;
-    },
-),
-```
-
-**SameBoy Implementation:**
-```c
-static void reti(GB_gameboy_t *gb, uint8_t opcode) {
-    ret(gb, opcode);
-    gb->ime = true;  // ← Immediate enable
-}
-```
-
-**Our Implementation:** Missing the `gb->ime = true` line!
-
-**Issue:** RETI should enable interrupts immediately (not with 1-instruction delay like EI).
-
-**Impact:** Fixes 2 tests
-- reti_timing ❌
-- reti_intr_timing ❌
-
----
-
-### 3. DMA Start Delay Missing
-
-**Location:** `src/Dma/OamDma.php:119-146`
-
-**Problem:** DMA starts transferring bytes immediately in the same M-cycle that DMA is triggered. According to Pan Docs, there should be a 1 M-cycle delay before first byte transfer.
-
-**Expected Behavior:**
-1. **Write to 0xFF46:** Triggers DMA
-2. **Delay:** 1 M-cycle delay before first byte transfer
-3. **Transfer:** 160 M-cycles to transfer 160 bytes
-4. **Total:** 161 M-cycles from trigger to completion
-
-**Current Behavior:** Transfer starts immediately, no startup delay.
-
-**Impact:** Fixes 0-1 tests
-- oam_dma_start ❌ (may already be fixed by Fix #1)
-
----
-
-### 4. Timer Uses Wrong Architecture
-
-**Location:** `src/Timer/Timer.php`
-
-**Problem:** Uses cycle accumulation instead of bit-selection from DIV counter.
-
-**Current Approach:**
-```php
-$this->timaCounter += $cycles;
-if ($this->timaCounter >= $frequency) {
-    $this->incrementTima();
-}
-```
-
-**SameBoy Approach:**
-- Uses bit-selection: `TAC_TRIGGER_BITS[] = {512, 8, 32, 128}`
-- Detects falling edge: "TIMA increases when a specific high-bit becomes a low-bit"
-- Implements 3-state reload machine (RUNNING → RELOADING → RELOADED)
-
-**Why This Matters:**
-- Writing to DIV can trigger TIMA increment (glitch behavior)
-- Changing TAC can cause immediate TIMA increment if selected bit was high
-- TIMA reload takes 4 M-cycles, during which writes to TIMA/TMA have edge cases
-
-**Impact:** Affects 13 timer tests (requires medium-hard effort)
-
----
-
-### 5. CPU Executes Instructions Atomically
-
-**Location:** `src/Cpu/InstructionSet.php` (entire file)
-
-**Problem:** Instructions execute all-at-once and return total cycles. Real hardware performs operations incrementally over M-cycles.
-
-**Example - CALL instruction:**
-- **Current:** Reads address, pushes PC, jumps → returns 24 cycles atomically
-- **Real hardware:** 6 M-cycles with state changes at specific boundaries
-
-**SameBoy Approach:**
-- Uses `cycle_read()` and `cycle_write()` with `pending_cycles = 4`
-- Memory operations happen at specific M-cycle boundaries
-- Enables accurate modeling of register access conflicts
-
-**Impact:** Affects 12 instruction timing tests (requires major refactor)
-
----
-
-## Comparison with SameBoy
-
-| Feature | PHPBoy | SameBoy | Impact |
-|---------|--------|---------|--------|
-| **Instruction execution** | Atomic | M-cycle stepping | 12 timing tests |
-| **RETI enables IME** | ❌ No | ✅ Immediate | 2 tests |
-| **DMA speed** | ❌ 4x too fast | ✅ Correct | 3 tests |
-| **DMA start delay** | ❌ No delay | ✅ 1 M-cycle | 1 test |
-| **Timer model** | Counter accumulation | Bit-selection + edge detection | 13 tests |
-| **TIMA reload** | Instant | 4 M-cycle state machine | 3 tests |
-| **Cycle tracking** | T-cycles | T-cycles with M-cycle conversion | Foundation |
-
----
-
-## Prioritized Fix Roadmap
-
-### Quick Wins (4-5 hours work, +7-8 tests)
-
-**Fix 1: OAM DMA Speed** ⭐ Easy (30 min, +2-3 tests)
-- **File:** `src/Dma/OamDma.php:119`
-- **Change:** Convert T-cycles to M-cycles before transferring
-- **Expected:** 11-12/39 tests passing (28-31%)
-
-**Fix 2: RETI IME Enable** ⭐ Easy (15 min, +2 tests)
-- **Files:** `src/Cpu/Cpu.php`, `src/Cpu/InstructionSet.php:3418`
-- **Change:** Add `setIMEImmediate()` and call in RETI
-- **Expected:** 13-14/39 tests passing (33-36%)
-
-**Fix 3: DMA Start Delay** ⭐⭐ Medium (1 hour, +0-1 tests)
-- **File:** `src/Dma/OamDma.php`
-- **Change:** Add 1 M-cycle delay before first byte transfer
-- **Expected:** 14-17/39 tests passing (36-44%)
-
-**Combined Expected Result:** 16-18/39 tests passing (41-46%)
-
----
-
-### Medium Effort (8 hours work, +10-11 tests)
-
-**Fix 4: TIMA Reload Delay** ⭐⭐ Medium (2 hours, +3 tests)
-- **File:** `src/Timer/Timer.php`
-- **Change:** Implement 4 M-cycle reload state machine
-- **Expected:** +3 tests (tima_reload, tima_write_reloading, tma_write_reloading)
-
-**Fix 5: Timer Bit-Selection** ⭐⭐⭐ Hard (4 hours, +4 tests)
-- **File:** `src/Timer/Timer.php`
-- **Change:** Rewrite timer to use 16-bit DIV counter with bit-selection
-- **Expected:** +4 tests (tim*_div_trigger tests)
-
-**Combined Expected Result:** 20-22/39 tests passing (51-56%)
-
----
-
-### Major Refactor (24+ hours work, +12 tests)
-
-**Fix 6: M-Cycle Accurate CPU** ⭐⭐⭐⭐ Very Hard (16+ hours, +12 tests)
-- **File:** `src/Cpu/InstructionSet.php` (complete rewrite)
-- **Change:** Implement M-cycle stepping for all 256 instructions
-- **Expected:** All 12 instruction timing tests
-
-**Expected Result:** 32-34/39 tests passing (82-87%)
-
----
-
-## Recommended Implementation Order
-
-**Phase 1 (Today):** Fixes 1-3 → Get to 41-46% with minimal risk
-
-**Phase 2 (This week):** Fixes 4-5 → Get to 51-56% with moderate effort
-
-**Phase 3 (Next sprint):** Fix 6 → Get to 82-87% with architectural changes
-
-**Remaining gaps:**
-- PPU timing edge cases
-- APU timing (not tested by current suite)
-- Hardware quirks/glitches
-
----
-
-## Specific Code Changes
-
-### Fix 1: OAM DMA Speed
-
-**File:** `src/Dma/OamDma.php:119`
-
-```php
-public function tick(int $cycles): void
-{
-    if (!$this->dmaActive) {
-        return;
-    }
-
-    // Convert T-cycles to M-cycles (1 M-cycle = 4 T-cycles)
-    $mCycles = intdiv($cycles, 4);
-
-    // Transfer one byte per M-cycle
-    for ($i = 0; $i < $mCycles; $i++) {
-        if ($this->dmaProgress >= self::TRANSFER_LENGTH) {
-            $this->dmaActive = false;
-            break;
-        }
-
-        // Read from source and write to OAM
-        $sourceAddress = $this->dmaSource + $this->dmaProgress;
-        $destAddress = self::OAM_START + $this->dmaProgress;
-        $value = $this->bus->readByte($sourceAddress);
-        $this->bus->writeByte($destAddress, $value);
-
-        $this->dmaProgress++;
-    }
-}
-```
-
----
-
-### Fix 2: RETI IME Enable
-
-**File 1:** `src/Cpu/Cpu.php` (add new method)
-
-```php
-/**
- * Enable interrupts immediately (used by RETI).
- * Unlike setIME(true), this enables IME without a 1-instruction delay.
- */
-public function setIMEImmediate(): void
-{
-    $this->ime = true;
-    $this->imeDelay = 0;
-}
-```
-
-**File 2:** `src/Cpu/InstructionSet.php:3407-3422` (update RETI handler)
-
-```php
-0xD9 => new Instruction(
-    opcode: 0xD9,
-    mnemonic: 'RETI',
-    cycles: 16,
-    handler: static function (Cpu $cpu): int {
-        $low = $cpu->getBus()->readByte($cpu->getSP()->get());
-        $cpu->getSP()->increment();
-        $high = $cpu->getBus()->readByte($cpu->getSP()->get());
-        $cpu->getSP()->increment();
-        $cpu->getPC()->set(($high << 8) | $low);
-
-        // RETI enables interrupts immediately (not delayed like EI)
-        $cpu->setIMEImmediate();
-
-        return 16;
-    },
-),
-```
-
----
-
-### Fix 3: DMA Start Delay
-
-**File:** `src/Dma/OamDma.php` (update class)
-
-```php
-final class OamDma
-{
-    private const OAM_START = 0xFE00;
-    private const TRANSFER_LENGTH = 160;
-
-    private bool $dmaActive = false;
-    private int $dmaSource = 0x0000;
-    private int $dmaProgress = 0;
-    private int $dmaDelay = 0;  // ← Add delay counter
-
-    // ... other methods ...
-
-    private function startDmaTransfer(int $sourcePage): void
-    {
-        $this->dmaSource = ($sourcePage << 8) & 0xFF00;
-        $this->dmaActive = true;
-        $this->dmaProgress = 0;
-        $this->dmaDelay = 1;  // ← 1 M-cycle delay before first byte
-    }
-
-    public function tick(int $cycles): void
-    {
-        if (!$this->dmaActive) {
-            return;
-        }
-
-        // Convert T-cycles to M-cycles (1 M-cycle = 4 T-cycles)
-        $mCycles = intdiv($cycles, 4);
-
-        // Handle startup delay
-        if ($this->dmaDelay > 0) {
-            $delayToProcess = min($this->dmaDelay, $mCycles);
-            $this->dmaDelay -= $delayToProcess;
-            $mCycles -= $delayToProcess;
-
-            if ($mCycles <= 0) {
-                return;  // Still in delay phase
-            }
-        }
-
-        // Transfer one byte per M-cycle
-        for ($i = 0; $i < $mCycles; $i++) {
-            if ($this->dmaProgress >= self::TRANSFER_LENGTH) {
-                $this->dmaActive = false;
-                break;
-            }
-
-            // Read from source and write to OAM
-            $sourceAddress = $this->dmaSource + $this->dmaProgress;
-            $destAddress = self::OAM_START + $this->dmaProgress;
-            $value = $this->bus->readByte($sourceAddress);
-            $this->bus->writeByte($destAddress, $value);
-
-            $this->dmaProgress++;
-        }
-    }
-}
-```
-
----
-
-## Testing Strategy
-
-### After Each Fix
-
-```bash
-# Test DMA fixes
-make test -- --filter="oam_dma"
-
-# Test RETI fix
-make test -- --filter="reti"
-
-# Full Mooneye suite
-make test -- tests/Integration/MooneyeTestRomsTest.php
-
-# Check overall progress
-make test -- tests/Integration/
-```
-
-### Track Progress
-
-- **Baseline:** 9/39 (23%)
-- **After Fix 1:** ~11-12/39 (28-31%)
-- **After Fix 2:** ~13-14/39 (33-36%)
-- **After Fix 3:** ~14-17/39 (36-44%)
-- **Target (Quick Wins):** 16-18/39 (41-46%)
-
----
-
-## References
-
-### SameBoy Implementation Study
-
-**Timer (bit-selection model):**
-- Uses `TAC_TRIGGER_BITS[] = {512, 8, 32, 128}` for bit positions
-- Detects falling edge: "TIMA increases when a specific high-bit becomes a low-bit"
-- Implements 3-state reload machine via `advance_tima_state_machine()`
-
-**CPU Timing:**
-- Uses `cycle_read()` and `cycle_write()` with `pending_cycles = 4`
-- M-cycle stepping, not atomic execution
-- Hardware conflict simulation for accurate register access timing
-
-**RETI Instruction:**
-```c
-static void reti(GB_gameboy_t *gb, uint8_t opcode) {
-    ret(gb, opcode);
-    gb->ime = true;  // Immediate enable
-}
-```
-
-**DMA:**
-- Explicit M-cycle tracking
-- Startup delay before first byte transfer
-- 160 M-cycles (640 T-cycles) for full transfer
-
-### Documentation
-
-- **Pan Docs:** https://gbdev.io/pandocs/
-- **SameBoy Source:** https://github.com/LIJI32/SameBoy
-- **PHPBoy Research:** `docs/research.md`
-
----
-
-## Conclusion
-
-PHPBoy has a solid foundation with 100% Blargg test pass rate, proving instruction correctness. The Mooneye failures are all **timing accuracy** issues, not logic bugs.
-
-The quick wins (Fixes 1-3) are low-risk, high-impact changes that can nearly double the Mooneye pass rate in a single afternoon. These fixes align PHPBoy's timing model with SameBoy's proven approach.
-
-**Next Steps:**
-1. Implement Fix 1 (DMA speed) - 30 minutes
-2. Implement Fix 2 (RETI IME) - 15 minutes
-3. Implement Fix 3 (DMA delay) - 1 hour
-4. Run full test suite and measure improvement
-5. Commit with detailed conventional commit message
-
-**Long-term:**
-- Medium effort fixes (4-5) → 51-56% pass rate
-- Major CPU refactor (6) → 82-87% pass rate
-- Final edge cases → 90-100% pass rate
diff --git a/docs/STATUS.md b/docs/STATUS.md
index c278cc5..6f200f0 100644
--- a/docs/STATUS.md
+++ b/docs/STATUS.md
@@ -201,7 +201,6 @@ This document tracks the implementation status of the PHPBoy Game Boy Color emul
 - **Deliverables**:
   - ✅ **Profiling Infrastructure**: Xdebug profiling with cachegrind output
   - ✅ **Benchmark Tooling**: `make benchmark`, `make benchmark-jit`, `make profile`, `make memory-profile`
-  - ✅ **Profiling Analysis**: Expected hotspots documented in `docs/profiling-analysis.md`
   - ✅ **Optimizations Applied**:
     - Inline instruction decode/execute in `Cpu::step()` (+3-7% expected)
     - Pre-build instruction cache with `InstructionSet::warmCache()` (+1-2% expected)
diff --git a/docs/performance.md b/docs/performance.md
index 71ba0bb..3b13e46 100644
--- a/docs/performance.md
+++ b/docs/performance.md
@@ -193,7 +193,6 @@ After each optimization:
 
 ## Related Documents
 
-- **[Profiling Analysis](profiling-analysis.md):** Expected hotspots and optimization priorities
 - **[Optimizations Log](optimizations.md):** Detailed implementation notes for all optimizations
 - **[Test Results](test-results.md):** ROM compatibility and test pass rates
 - **[Status](STATUS.md):** Overall project status and step completion
diff --git a/docs/profiling-analysis.md b/docs/profiling-analysis.md
deleted file mode 100644
index 5ab42bf..0000000
--- a/docs/profiling-analysis.md
+++ /dev/null
@@ -1,329 +0,0 @@
-# PHPBoy Profiling Analysis
-
-**Last Updated:** 2025-11-09
-**Status:** Expected Hotspots (Profiling infrastructure ready, requires Docker to run)
-
-## Overview
-
-This document analyzes expected performance hotspots in PHPBoy based on emulator architecture and common PHP performance patterns. Once profiling data is available via `make profile ROM=<rom> FRAMES=1000`, this document will be updated with actual measurements.
-
-## Expected Hot Paths
-
-Based on the Game Boy architecture and emulator implementation:
-
-### 1. CPU Instruction Dispatch (CRITICAL PATH)
-**Expected Impact:** 40-50% of total execution time
-
-**Hot Spots:**
-- `Cpu::step()` - Called every instruction (60 FPS × 154 scanlines × ~114 instructions/line ≈ 1M calls/second)
-- `Cpu::fetch()` - Memory bus read for every instruction
-- `Cpu::execute()` - Closure invocation overhead
-- `InstructionSet::getInstruction()` - Array lookup (cached, but still called every instruction)
-- Instruction handler closures - 256 base + 256 CB opcodes = 512 handlers
-
-**Current Optimizations:**
-- ✅ Lazy instruction building with static caching (`self::$instructions`)
-- ✅ Direct closure invocation in `execute()`
-
-**Remaining Opportunities:**
-- Eliminate `decode()` method call overhead (inline instruction lookup)
-- Pre-build all instructions on initialization (trade memory for CPU)
-- Consider match expression vs array lookup for opcode dispatch
-
-### 2. Memory Bus Access (CRITICAL PATH)
-**Expected Impact:** 25-35% of total execution time
-
-**Hot Spots:**
-- `SystemBus::readByte()` - Called for every instruction fetch + every memory load
-- `SystemBus::writeByte()` - Called for every memory store
-- Memory region routing (VRAM, WRAM, HRAM, I/O, cartridge)
-- MBC (Memory Bank Controller) logic
-
-**Frequency:**
-- ~1M instruction fetches/second
-- ~500K additional memory operations/second (loads/stores)
-- Total: ~1.5M bus accesses/second
-
-**Current Implementation:**
-- Assumed: Interface-based routing via if/elseif chains or match expressions
-
-**Opportunities:**
-- Inline fast paths for common memory regions
-- Direct array access for WRAM/HRAM (avoid method calls)
-- Cache frequently accessed I/O registers
-
-### 3. PPU Rendering (CRITICAL PATH)
-**Expected Impact:** 15-20% of total execution time
-
-**Hot Spots:**
-- `Ppu::step()` - Called every CPU cycle (4.19 MHz / 4 = ~1M/second)
-- Pixel fetching and rendering (mode 3)
-- Tile data lookups in VRAM
-- Sprite evaluation (OAM search)
-- Palette color conversion
-
-**Current Implementation:**
-- Simplified pixel transfer timing (172 dots fixed)
-- Scanline buffer rendering
-
-**Opportunities:**
-- Lazy evaluation: only render when scanline completes
-- Cache tile data between frames (tiles rarely change)
-- Optimize color palette lookups with array indexing
-
-### 4. Flag Register Synchronization
-**Expected Impact:** 5-10% of total execution time
-
-**Hot Spots:**
-- `FlagRegister::syncToAF()` - Called after every flag modification
-- `FlagRegister::syncFromAF()` - Called after `POP AF`
-- Bit manipulation for Z, N, H, C flags
-
-**Frequency:**
-- ~50% of instructions modify flags
-- ~500K flag sync operations/second
-
-**Current Implementation:**
-- Two-way synchronization between FlagRegister object and AF Register16
-
-**Opportunities:**
-- Inline flag operations (avoid method call overhead)
-- Lazy synchronization: only sync when AF is read/written
-- Direct bit manipulation on AF low byte
-
-### 5. Clock Tracking
-**Expected Impact:** 3-5% of total execution time
-
-**Hot Spots:**
-- `Clock::tick()` - Called after every CPU instruction
-- Timer updates based on clock cycles
-- PPU/APU synchronization
-
-**Opportunities:**
-- Inline clock accumulation (avoid method call)
-- Batch timer updates (every 4-16 instructions vs every instruction)
-
-## Optimization Priorities
-
-Based on expected impact and implementation effort:
-
-### Priority 1: High Impact, Medium Effort
-
-1. **Inline Instruction Decode** (`Cpu::step()`)
-   - Current: `$instruction = $this->decode($opcode);`
-   - Optimized: `$instruction = self::$instructions[$opcode] ?? self::buildInstruction($opcode);`
-   - Expected: 2-5% performance gain
-   - Eliminates one method call per instruction
-
-2. **Pre-build Instruction Cache**
-   - Current: Lazy building on first access
-   - Optimized: Build all 512 instructions on `InstructionSet` initialization
-   - Expected: 1-2% performance gain (eliminates isset check)
-   - Trade-off: ~100KB additional memory for faster execution
-
-3. **Inline Memory Fast Paths** (`SystemBus`)
-   - Current: All memory access through `readByte()`/`writeByte()`
-   - Optimized: Direct array access for WRAM/HRAM
-   - Expected: 5-10% performance gain
-   - Example:
-     ```php
-     // Fast path for WRAM (0xC000-0xDFFF)
-     if ($address >= 0xC000 && $address <= 0xDFFF) {
-         return $this->wram[$address - 0xC000];
-     }
-     ```
-
-### Priority 2: Medium Impact, Low Effort
-
-4. **Enable OPcache** (Already implemented in Dockerfile)
-   - Expected: 10-15% performance gain
-   - Zero code changes required
-   - Verify with: `php -i | grep opcache`
-
-5. **Enable PHP 8.5 JIT**
-   - Expected: 20-40% performance gain for hot loops
-   - Configuration: `opcache.jit=tracing`, `opcache.jit_buffer_size=100M`
-   - Test with: `make benchmark-jit`
-
-6. **Reduce Flag Sync Overhead**
-   - Current: Two-way sync on every flag operation
-   - Optimized: Lazy sync only when AF is accessed directly
-   - Expected: 3-5% performance gain
-
-### Priority 3: Lower Impact or Higher Risk
-
-7. **Cache Tile Data**
-   - Pre-decode tiles to pixel arrays on VRAM write
-   - Expected: 2-5% performance gain
-   - Risk: Increases memory usage significantly
-
-8. **Lookup Tables for Flag Calculations**
-   - Pre-compute half-carry and carry flags for common operations
-   - Expected: 2-3% performance gain
-   - Trade-off: Memory vs CPU
-
-## PHP-Specific Optimizations
-
-### 1. Object Allocation Reduction
-
-**Current:** Many small objects created per frame (Register8, Color, etc.)
-
-**Optimization:** Use primitives (int, array) where possible
-
-**Example:**
-```php
-// Before: $color = new Color($r, $g, $b);
-// After: $color = ($r << 16) | ($g << 8) | $b;
-```
-
-**Expected Impact:** 5-10% performance gain, reduced GC pressure
-
-### 2. Property Access Optimization
-
-**Current:** Accessing properties through getters (`$cpu->getA()`)
-
-**Optimization:** Direct property access in hot paths (make properties public or use readonly)
-
-**Trade-off:** Breaks encapsulation, but PHP property access is slower than C#/Java
-
-### 3. Method Call Reduction
-
-**Current:** Many small methods called millions of times
-
-**Optimization:** Inline critical methods (especially getters/setters)
-
-**Expected Impact:** 5-10% performance gain
-
-### 4. Array Access Optimization
-
-**Current:** Associative arrays with string keys
-
-**Optimization:** Use integer-indexed arrays where possible
-
-**Example:**
-```php
-// Before: $registers = ['A' => 0, 'B' => 0, ...];
-// After: $registers = [0, 0, ...]; // Use constants for indices
-```
-
-## Measurement Strategy
-
-When profiling infrastructure is available:
-
-1. **Baseline Measurement**
-   ```bash
-   make benchmark ROM=third_party/roms/commercial/tetris.gb FRAMES=3600
-   ```
-   - Record FPS, memory usage
-   - Establish baseline for comparison
-
-2. **Profiling Session**
-   ```bash
-   make profile ROM=third_party/roms/commercial/tetris.gb FRAMES=1000
-   kcachegrind var/profiling/cachegrind.out.*
-   ```
-   - Identify actual top 10 hotspots
-   - Compare with expected hotspots above
-
-3. **Optimization Cycle**
-   For each optimization:
-   - Apply optimization
-   - Run benchmark
-   - Calculate performance delta
-   - Run `make test` to verify correctness
-   - Document in `docs/optimizations.md`
-
-4. **JIT Testing**
-   ```bash
-   # Baseline (no JIT)
-   make benchmark ROM=tetris.gb FRAMES=3600
-
-   # With JIT
-   make benchmark-jit ROM=tetris.gb FRAMES=3600
-
-   # Compare FPS improvement
-   ```
-
-## Expected Performance Gains
-
-Conservative estimates for cumulative optimizations:
-
-| Optimization | Expected Gain | Cumulative FPS |
-|--------------|---------------|----------------|
-| Baseline (Step 13) | - | 27.5 FPS (46%) |
-| OPcache enabled | +12% | 30.8 FPS (51%) |
-| Inline decode | +3% | 31.7 FPS (53%) |
-| Memory fast paths | +7% | 33.9 FPS (57%) |
-| Pre-built instructions | +2% | 34.6 FPS (58%) |
-| Flag sync optimization | +4% | 36.0 FPS (60%) |
-| **PHP 8.5 JIT** | **+30%** | **46.8 FPS (78%)** |
-| Object allocation reduction | +8% | 50.5 FPS (84%) |
-
-**Target Achievement:**
-- ✅ Minimum (30 FPS): Already achieved at baseline
-- 🎯 Target (60 FPS): Achievable with OPcache + JIT + optimizations
-- ⏸️ Stretch (120 FPS): Unlikely in pure PHP, may require native extensions
-
-## Risk Assessment
-
-### Low Risk (Safe to Apply)
-- ✅ OPcache: Standard PHP optimization, zero code changes
-- ✅ Instruction cache pre-building: Pure performance optimization
-- ✅ JIT: Can be toggled on/off, no code changes
-
-### Medium Risk (Test Thoroughly)
-- ⚠️ Inline decode: Minor refactoring, maintain test coverage
-- ⚠️ Memory fast paths: Ensure bus routing logic remains correct
-- ⚠️ Flag sync optimization: Critical for correctness, extensive testing required
-
-### High Risk (Prototype First)
-- 🔴 Object allocation changes: May break type safety
-- 🔴 Breaking encapsulation: Makes code harder to maintain
-- 🔴 Native extensions (FFI): Platform-specific, complex build
-
-## Recommendations
-
-1. **Start with OPcache:** Already configured, just needs verification
-2. **Test JIT:** Biggest potential gain with zero code changes
-3. **Profile first:** Confirm expected hotspots match reality
-4. **Incremental optimizations:** Apply one at a time, measure impact
-5. **Maintain correctness:** All tests must pass after each optimization
-6. **Document everything:** Track what works, what doesn't, and why
-
-## Tools and Commands
-
-```bash
-# Build Docker image with profiling support
-make rebuild
-
-# Run baseline benchmark
-make benchmark ROM=third_party/roms/commercial/tetris.gb FRAMES=3600
-
-# Run with profiling
-make profile ROM=third_party/roms/commercial/tetris.gb FRAMES=1000
-
-# Analyze profile data
-kcachegrind var/profiling/cachegrind.out.*
-
-# Test with JIT
-make benchmark-jit ROM=third_party/roms/commercial/tetris.gb FRAMES=3600
-
-# Memory profiling
-make memory-profile ROM=third_party/roms/commercial/tetris.gb FRAMES=1000
-
-# Verify tests still pass
-make test
-
-# Verify lint passes
-make lint
-```
-
-## Next Steps
-
-1. Build Docker image with updated Dockerfile
-2. Run baseline benchmark to establish current FPS
-3. Run profiling session to identify actual hotspots
-4. Update this document with real profiling data
-5. Apply optimizations in priority order
-6. Measure impact of each optimization
-7. Document results in `docs/optimizations.md`