Skip to content

perf: fully tail-recursive CPS executor with line tracking off heap#156

Merged
davydog187 merged 2 commits intomainfrom
perf/cps-executor
Feb 27, 2026
Merged

perf: fully tail-recursive CPS executor with line tracking off heap#156
davydog187 merged 2 commits intomainfrom
perf/cps-executor

Conversation

@davydog187
Copy link
Contributor

@davydog187 davydog187 commented Feb 27, 2026

Summary

  • Target A: Removes current_line and current_source from %State{}. These are now threaded as a plain integer line parameter through do_execute/8. A :source_line instruction no longer allocates a new State struct — it updates a local variable only.

  • Target B: Converts do_execute to a fully tail-recursive CPS loop with two new parameters — cont (continuation stack) and frames (call frame stack):

    • :call for Lua closures pushes a frame onto frames and tail-calls the callee. Erlang stack depth is O(1) regardless of Lua recursion depth.
    • :test / :test_and / :test_or push rest onto cont instead of recursing into the body. Also eliminates the O(N) ++ list concat in test_and/test_or.
    • All loop instructions use synthetic CPS continuation entries ({:cps_while_test, ...}, {:cps_while_body, ...}, etc.) so break and return work correctly at any nesting depth.
    • :break scans cont for a {:loop_exit, _} marker — no more {:break, regs, state} sentinel tuple.
    • New do_frame_return/6 restores caller context (registers, upvalues, proto, cont) from a saved frame on function return.

Test plan

  • All 1,273 existing tests pass with 0 failures
  • break inside if inside while exits the correct loop (via {:loop_exit, _} marker in cont)
  • Recursive functions (factorial, fibonacci) return correct results
  • return f() tail-call position (result_count == -1) chains through do_frame_return
  • Multi-return, vararg, closures, pcall all covered by existing test suite
  • Run mix run benchmarks/fibonacci.exs to confirm memory reduction vs baseline (8.07 GB → expected < 2.5 GB)

🤖 Generated with Claude Code

Target A: Remove current_line/current_source from %State{} and thread
line as the 8th parameter to do_execute. A :source_line instruction now
updates a local variable only — no State struct allocation on the heap.

Target B: Convert do_execute to a fully tail-recursive CPS loop with
two new parameters: cont (continuation stack) and frames (call frame
stack). Key changes:

- :call for Lua closures pushes a frame onto `frames` and tail-calls
  the callee — Erlang stack depth is now O(1) regardless of Lua
  recursion depth.
- :test/:test_and/:test_or push `rest` onto `cont` instead of
  recursing. Eliminates non-tail calls for every if/else branch, and
  removes the O(N) list concat (++) in test_and/test_or.
- All loop instructions (while/repeat/numeric_for/generic_for) use
  synthetic CPS continuation entries so break and return work correctly
  at any nesting depth without Erlang stack growth.
- :break scans `cont` for a {:loop_exit, _} marker instead of
  returning a {:break, regs, state} sentinel tuple.
- :return/:return_vararg delegate to new do_frame_return/6 which
  restores caller context from a frame entry.
- Native function calls handled inline via continue_after_call/11.

Expected outcome: memory/iter ~2-3x lower (eliminating ~3-4 State
allocations per :source_line and all intermediate register tuples held
by Erlang frames), Erlang stack depth O(1) instead of O(call depth).

All 1,273 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@davydog187
Copy link
Contributor Author

 MIX_ENV=benchmark mix run benchmarks/fibonacci.exs                       
  Compiling 3 files (.ex)                                                  
  Compiling 1 file (.ex)                                                   
  Generated lua app                                                    
  Operating System: macOS                                              
  CPU Information: Apple M1 Pro                                        
  Number of Available Cores: 8                                         
  Available memory: 16 GB                                              
  Elixir 1.19.4                                                        
  Erlang 27.3.4.6                                                      
  JIT enabled: true                                                    
                                                                       
  Benchmark suite executing with the following configuration:          
  warmup: 2 s                                                          
  time: 10 s                                                           
  memory time: 1 s                                                     
  reduction time: 0 ns                                                 
  parallel: 1                                                          
  inputs: none specified                                               
  Estimated total run time: 52 s                                       
  Excluding outliers: false                                            
                                                                       
  Benchmarking C Lua (luaport) ...                                     
  Benchmarking lua (chunk) ...                                         
  Benchmarking lua (eval) ...                                          
  Benchmarking luerl ...                                               
  Calculating statistics...                                            
  Formatting results...                                                
                                                                       
  Name                      ips        average  deviation              
  median         99th %                                                
  C Lua (luaport)        147.30      0.00679 s    ±49.92%      0.00651 
   s       0.0107 s                                                    
  luerl                    0.86         1.16 s    ±12.05%         1.07 
   s         1.40 s                                                    
  lua (eval)               0.68         1.47 s    ±10.79%         1.40 
   s         1.81 s                                                    
  lua (chunk)              0.65         1.54 s    ±23.37%         1.36 
   s         2.35 s                                                    
                                                                       
  Comparison:                                                          
  C Lua (luaport)        147.30                                        
  luerl                    0.86 - 170.61x slower +1.15 s               
  lua (eval)               0.68 - 216.51x slower +1.46 s               
  lua (chunk)              0.65 - 226.91x slower +1.53 s               
                                                                       
  Memory usage statistics:                                             
                                                                       
  Name               Memory usage                                      
  C Lua (luaport)      0.00000 GB                                      
  luerl                   2.45 GB - 15689106.52x memory usage +2.45 GB 
  lua (eval)              7.78 GB - 49713723.62x memory usage +7.78 GB 
  lua (chunk)             7.78 GB - 49712813.33x memory usage +7.78 GB 
                                                                       
  **All measurements for memory usage were the same**   

@davydog187 davydog187 merged commit 13b2964 into main Feb 27, 2026
2 checks passed
@davydog187 davydog187 deleted the perf/cps-executor branch February 27, 2026 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant