The Day We Beat LuaJIT

In Post #6, we hit a wall. mandelbrot was 4.2x behind LuaJIT, fib was 3x behind, function calls were 9x behind. We had traced the mandelbrot gap to the register allocator — a frequency-based scheme that could not handle flat distributions — and outlined a plan for live-range-based linear scan. The fib gap was in boxing overhead: every recursive call packed arguments into 32-byte Values and unpacked them on the other side. The function call gap was worse: each call to a trivial function like add(x, 1) went through the full call-exit-reenter cycle.

The post ended with a technical roadmap. Register allocation for mandelbrot, type specialization for fib, inlining for function calls.

What actually happened was more interesting. We did not touch the register allocator. We did not implement general type specialization. Instead, we chased the function call benchmark, found a trick that was worth 5.4x, and watched it accidentally solve half of the fib problem too. Then we pushed fib the rest of the way with three targeted optimizations. At the end of the day, the terminal printed a number we had been chasing for weeks:

fib(20) warm:  24us

The Function Inlining Story

func add(a, b) {
    return a + b
}

func callMany() {
    var x = 0
    for i := 0; i < 10000; i++ {
        x = add(x, 1)
    }
    return x
}

The cost is not in the addition. The cost is in what happens around the addition. Each iteration calls add(x, 1), which means:

That is dozens of instructions of overhead wrapping a single ADD. The architecture at the time could not avoid it: the method JIT compiled functions independently, and calls between them always went through the interpreter’s call machinery.

Part 1: Accumulator Pinning

The first insight was about the loop accumulator. In callMany, x is the accumulator — it is read at the start of each iteration (add(x, 1)), modified by the call’s return value (x = add(...)), and read again at the next iteration. The variable x lives in a VM register slot. Before this optimization, every use of x loaded it from the Value array in memory, and every write stored it back.

Accumulator pinning assigns x to a physical ARM64 register — X24 — for the entire lifetime of the compiled loop. The value stays in X24 across iterations. No loads, no stores, no round-trips through memory.

But pinning the accumulator only helps if the call to add can read from X24 and write to X24 directly. If the call still goes through the full call-exit machinery, the pinned register gets spilled before the call and reloaded after. No net benefit.

Part 2: Argument Source Tracing

The method JIT already knows, at compile time, which VM registers hold the arguments to each CALL instruction. For add(x, 1), it knows argument 0 is x (which is now pinned to X24) and argument 1 is the constant 1.

Argument source tracing extends this: instead of emitting code that copies x into the call frame’s argument slot, the JIT recognizes that x is already in X24 and passes it directly. The constant 1 is handled as an immediate — no memory load, no register copy.

Part 3: Result Destination Tracing

The same logic applies in reverse for the return value. The JIT knows that the result of add(x, 1) gets assigned back to x. Since x is pinned to X24, the JIT emits code that puts the return value directly into X24 instead of routing it through a temporary Value slot.

With all three pieces in place, the entire call to add(x, 1) collapses. The JIT sees:

One instruction for the function call. One instruction for the loop counter. One comparison. One branch. The entire add(x, 1) call — the function lookup, the frame push, the argument copy, the body execution, the return value copy, the frame pop — is gone. Replaced by ADD X24, X24, #1.

The result: 28us to 5.1us. A 5.4x improvement. The LuaJIT gap on function calls went from 9x to 1.7x in one optimization.

The Serendipity

Before starting the function inlining work, fib(20) ran in 35us. LuaJIT does it in 26us — a 1.34x gap. fib was not the target of the function inlining optimization. The callMany benchmark was.

func fib(n) {
    if n < 2 { return n }
    return fib(n - 1) + fib(n - 2)
}

Each recursive call goes through the same call machinery that callMany was suffering from. And the accumulator pinning optimization did not just apply to simple loop accumulators — it applied to any register that the JIT could prove was read and written in a predictable pattern.

The self-call path in the method JIT already recognized fib calling fib. It compiled the recursive calls as direct jumps rather than full interpreter-mediated calls. But before accumulator pinning, the intermediate results still bounced through memory between calls. The pinning framework improved the general register management for self-call functions, reducing unnecessary spills at call boundaries.

After the function inlining changes landed — before any fib-specific optimization — I ran the benchmark out of curiosity:

fib(20) warm:  28us    (was 35us)

A 20% improvement on a benchmark we were not even targeting. The gap to LuaJIT shrank from 1.34x to 1.07x. Almost there. Just by cleaning up register management in the self-call path as a side effect of function inlining.

This is one of the things that makes compiler optimization work unpredictable. You pull on one thread and the whole fabric shifts. The accumulator pinning framework was designed for add(x, 1) in a loop. It turned out to be exactly what fib(n - 1) + fib(n - 2) needed too.

The Final Push

1.07x behind. Two microseconds. The gap was small enough to see the finish line but large enough that noise could not explain it away. fib(20) at 28us, LuaJIT at 26us. Three more optimizations closed it.

Optimization 1: Pin R(0) to X19

In GScript’s calling convention, R(0) holds the first parameter of the current function. For fib(n), R(0) is n. Every entry to the function loads n from the Value array in memory:

X19 is a callee-saved register on ARM64 — it survives across function calls without needing explicit save/restore. By pinning R(0) to X19, the parameter n lives in a physical register from the moment the function is entered. No memory load at function entry. No memory store at call boundaries.

For fib(20), which makes 13,529 function entries, eliminating the memory load at each entry saves roughly 13,000 load instructions. At Apple Silicon’s ~4-cycle load latency, that is ~54,000 cycles — about 1.5us at 3.5 GHz.

Optimization 2: Skip Spills at Nested Returns

When a compiled function returns, the method JIT used to spill all pinned registers back to the Value array. This ensures the caller can find the return value in the expected memory location. But for self-recursive functions, the “caller” is the same function — and it knows the registers are pinned.

Consider fib(n - 1) returning to the fib that called it. The return value goes into a register. Then fib(n - 2) is called. The method JIT was spilling all pinned registers before this second call, even though the only register the second call needs is the parameter n - 2 — which is a freshly computed value, not a pinned register.

The optimization: skip spillPinnedRegs at return sites within self-recursive calls. The caller already knows where the pinned registers are (they are in the same physical registers, because it is the same function). No spill, no reload.

Optimization 3: LOADINT Constant Propagation

LOADINT  R(3), 2       // load constant 2 into R(3)
LT       R(0), R(3)    // compare n < 2

Before this optimization, the JIT compiled LOADINT R(3), 2 as a memory store (write the integer 2 to the Value array at slot 3) and then the LT instruction loaded it back from memory. Two memory operations for a constant.

The fix: when the JIT sees LT R(0), R(3) and R(3) was defined by a LOADINT with a known constant, it emits an immediate comparison:

One instruction instead of three (store + load + compare). And the LOADINT R(3), 2 instruction becomes dead code — nothing else reads R(3), so the JIT eliminates the store entirely. Dead store elimination, for free.

The Combined Effect

Each optimization is small. Pin a register here, skip a spill there, propagate a constant. But in a function that executes 13,529 times in 24 microseconds, every instruction matters. The three optimizations together:

fib(20) timeline:
  35us  → start of day (1.34x behind LuaJIT)
  28us  → after fn-inline accumulator pinning (1.07x behind)
  24us  → after R(0) pinning + const propagation + dead store elimination

The Moment

The benchmark script runs each test 100 times and reports the median. I had been staring at numbers in the 28-35us range for days. The first run after the three optimizations:

=== Method JIT Benchmarks (warm, after compilation) ===
fib(20):       24us
ackermann:     17us
callMany:      5.1us

$ luajit -e "
local function fib(n)
    if n < 2 then return n end
    return fib(n-1) + fib(n-2)
end
-- warmup + measure
for i=1,100 do fib(20) end
local t = os.clock()
for i=1,100 do fib(20) end
print(string.format('fib(20): %.0fus', (os.clock()-t)/100*1e6))
"
fib(20): 26us

Not on a synthetic microbenchmark we designed to win. Not with a warm cache advantage. The same algorithm, the same recursion depth, the same computation. GScript’s method JIT, running on a language implemented in Go, compiling to ARM64 on the fly, producing code that outperforms Mike Pall’s hand-tuned trace compiler on the canonical recursive benchmark.

It is one benchmark. It is the benchmark that plays to our method JIT’s strengths (pure recursion, no loops, no table ops). But it is real.

The Scoreboard

Benchmark	GScript	LuaJIT	Ratio	Status
fib(20)	24us	26us	0.92x	GScript wins by 9%
ackermann(3,11)	17us	~17us	~1.0x	Tied
callMany (fn calls)	5.1us	3us	1.7x	LuaJIT leads
mandelbrot(1000)	0.23s	0.056s	4.0x	LuaJIT leads
table ops (nbody)	268us	36us	7.5x	LuaJIT leads

The pattern in the results reveals the architecture. GScript wins on pure computation with recursion — fib and ackermann are function-call-heavy, integer-only, no table access. The method JIT handles these well because it compiles functions as units, recognizes self-recursion, and (now) pins registers across calls.

GScript loses on loop-heavy computation (mandelbrot) and table-heavy computation (nbody). mandelbrot’s gap is in the trace JIT’s register allocator and sub-trace call overhead, as analyzed in Post #6. nbody’s gap is in the 32-byte Value representation — every table read moves 4x more data than LuaJIT’s 8-byte NaN-boxed TValues.

What This Means

It means the method JIT works. For compute-heavy, self-recursive functions, GScript’s code quality is now at or above LuaJIT’s level. The combination of accumulator pinning, R(0) register pinning, constant propagation, and dead store elimination produces tight ARM64 that wastes very few instructions.

fib and ackermann are the benchmarks most amenable to method-JIT optimization — they are pure recursion, they fit in registers, they have no polymorphism. The remaining benchmarks expose deeper architectural gaps:

Function calls (1.7x gap): callMany is 5.1us vs LuaJIT’s 3us. The remaining gap comes from call patterns that the JIT cannot fully inline — when the callee is not a simple return a + b, when arguments are not directly traceable to pinned registers, when the return value flows through a complex expression. Closing this gap requires more general inlining: recording through CALL instructions and emitting the callee’s body inline, the way LuaJIT’s trace JIT does.

mandelbrot (4.0x gap): The trace JIT’s inner loop is 26 instructions per iteration vs a theoretical minimum of ~15. The register allocator is better than it was (SSA-ref-level linear scan replaced the frequency-based allocator from Post #6), but the 1 million sub-trace calls per mandelbrot(1000) each carry a 61-instruction prologue/epilogue. The fix is code inlining — copying the inner trace’s machine code into the outer trace, eliminating the call boundary entirely.

Table operations (7.5x gap): This is the representation problem. GScript’s 32-byte Values vs LuaJIT’s 8-byte NaN-boxed TValues. Every table read, every table write, every array access moves 4x more memory. The GC sees 4x more pointers. The caches hold 4x fewer values. NaN-boxing is the fix, and it touches every file in the runtime. It is a multi-week redesign, not an optimization pass.

What Changed Today

The fib(20) timeline tells the story of how JIT optimization actually works. Not a single breakthrough, but a cascade:

53us  →  type seeding + tag elimination + compact frames
35us  →  (improvements from previous cycle)
28us  →  fn-inline accumulator pinning (targeting callMany, not fib)
24us  →  R(0) pinning + const propagation + dead store elimination

The biggest single drop — 35us to 28us — came from an optimization aimed at a different benchmark. The targeted fib optimizations (R(0) pinning, const propagation) contributed the final 4us. Both were necessary. Neither alone would have crossed the line.

The lesson is that JIT compiler optimizations are not independent. Improving the general call machinery (for callMany) rippled into recursive calls (for fib). Pinning an accumulator register (for loop accumulators) turned out to pin self-call intermediate values too. The boundaries between “function inlining” and “recursive call optimization” and “register allocation” blur when the same physical registers carry values across all three domains.

This is what Mike Pall understood when he designed LuaJIT as a unified trace compiler rather than a collection of independent optimization passes. Each optimization sees the full picture. We are arriving at the same insight from the opposite direction — building independent optimizations that keep accidentally helping each other.

The Road Ahead

The scoreboard has two green cells and three red ones. The next targets, in order of expected impact:

Where We Left Off