Beyond LuaJIT
Notes on building a JIT compiler for GScript.
Posts
March 2026
The Box/Unbox Toll
NaN-boxing regressed compute benchmarks by 16-48%. Seven optimizations across three JIT tiers and the VM eliminate the overhead: pinned tag register, UBFX pointer extraction, trace FORLOOP register pinning, branchless sign extension. FibRecursive recovers from 47.9us to 23.9us.
March 2026
Eight Bytes That Change Everything
Season 2 begins. NaN-boxing shrinks Value from 24 bytes to 8. IEEE 754 quiet NaN payload, pointer sub-types, gcRoots safety net, and 210+ JIT codegen changes. The foundation for closing the table-heavy gap.
March 2026
The Last Thirty Percent
Slot reuse guard fix, type-specialized arrays, FMADD/FMSUB fusion. Mandelbrot drops 33%. binary_trees runs for the first time. And the Value 24B wall becomes undeniable.
March 2026
The Cold Code Revolution
BOLT-style cold code splitting turns fn_calls from 70% behind LuaJIT to 5% behind. Seven optimizations, six bug fixes, one failed experiment, and the day we nearly caught LuaJIT on function calls.
March 2026
What the Academics Know That We Don't
Five cutting-edge techniques surveyed: Copy-and-Patch, Deegen, MLGO, BOLT, and NaN-boxing. Which ones actually help close the LuaJIT gap?
March 2026
The Day We Beat LuaJIT
fib(20): GScript 24µs, LuaJIT 26µs. Function inlining, register pinning, and the moment we saw the number.
March 2026
The 4.2× Wall
We hit a wall at 4.2x behind LuaJIT. 50 instructions per iteration vs theoretical 15. The path forward: live-range register allocation.
March 2026
The Blacklist That Changed Everything
One simple blacklist turned mandelbrot from 1.53x to 6.09x. How a million wasted recording attempts were killing performance.
March 2026
The Profiler Told Us We Were Wrong
Our trace JIT makes most programs slower. Four research agents, one bombshell, and a new plan.
March 2026
The Day I Wasted Chasing a Fake 88x
The ×88 was wrong. Eight bugs, one humbling day, and the real path to 10x.
March 2026
Introducing SSA IR
Why every successful JIT uses SSA. Our IR design, type specialization, and integer unboxing.
March 2026
From Interpreter to Tracing JIT
18 optimizations, 2.4x result. What worked, what didn't, and why we need SSA.
vs LuaJIT (warm benchmarks) — post-NaN-boxing
NaN-boxing landed (Value 24B → 8B). JIT codegen not yet adapted — regressions expected. Recovery phase ahead.
| Benchmark | GScript | LuaJIT | Result |
| fib(20) | 47.9µs | 32.0µs | 1.5× gap |
| fn calls (10K) | 4.14µs | 3.1µs | 1.3× gap |
| ackermann(3,4) | 40.2µs | 15.2µs | 2.6× gap |
Full Suite (15 benchmarks)
| Program | VM | JIT | Trace | Best | LuaJIT |
| fib(35) | 1.187s | 0.036s | 0.037s | 0.036s | 0.032s |
| sieve(1M) | 0.284s | 0.025s | 0.026s | 0.025s | 0.014s |
| mandelbrot | 1.509s | 1.500s | timeout | 1.500s | 0.072s |
| ackermann | 0.201s | 0.011s | 0.011s | 0.011s | 0.008s |
| matmul | 1.163s | 1.161s | 1.400s | 1.161s | 0.029s |
| spectral_norm | 0.871s | 0.762s | 0.775s | 0.762s | 0.009s |
| nbody | 2.473s | 2.565s | 2.469s | 2.469s | 0.043s |
| fannkuch(9) | 0.662s | 0.665s | timeout | 0.662s | 0.025s |
| sort(50K) | 0.207s | 0.207s | timeout | 0.207s | 0.016s |
| sum_primes | 0.029s | 0.027s | 0.910s | 0.027s | 0.002s |
| mutual_recursion | 0.150s | 0.245s | 0.259s | 0.150s | 0.005s |
| method_dispatch | 0.093s | 0.123s | 0.127s | 0.093s | 0.000s |
| closure_bench | 0.071s | 0.073s | 0.075s | 0.071s | 0.012s |
| string_bench | 0.051s | 0.052s | 0.187s | 0.051s | 0.010s |
| binary_trees | 2.385s | crash | timeout | 2.385s | 0.17s |
Warm Micro-benchmarks (JIT vs VM)
| Benchmark | JIT | VM | Speedup |
| FunctionCalls(10K) | 4.14µs | 586.3µs | ×141.6 |
| HeavyLoop | 38.1µs | 2201.0µs | ×57.8 |
| FibRecursive(20) | 47.9µs | 1669.2µs | ×34.8 |
| Ackermann(3,4) | 40.2µs | 734.7µs | ×18.3 |
| FibIterative(30) | 283.5ns | 1110ns | ×3.9 |
NaN-boxing Impact (24B → 8B)
| Benchmark | Before (24B) | After (8B) | Change |
| JIT FibRecursive(20) | 19.4µs | 47.9µs | -147% |
| JIT FunctionCalls(10K) | 2.66µs | 4.14µs | -56% |
| JIT HeavyLoop | 25.8µs | 38.1µs | -48% |
| sieve(1M x3) best | 0.080s | 0.025s | +69% |
| mandelbrot best | 0.155s | 1.500s | trace broken |
| binary_trees best | 1.255s | 2.385s | -90% |