Beyond LuaJIT

Notes on building a JIT compiler for GScript.

Posts

March 2026
The Box/Unbox Toll

NaN-boxing regressed compute benchmarks by 16-48%. Seven optimizations across three JIT tiers and the VM eliminate the overhead: pinned tag register, UBFX pointer extraction, trace FORLOOP register pinning, branchless sign extension. FibRecursive recovers from 47.9us to 23.9us.

March 2026
Eight Bytes That Change Everything

Season 2 begins. NaN-boxing shrinks Value from 24 bytes to 8. IEEE 754 quiet NaN payload, pointer sub-types, gcRoots safety net, and 210+ JIT codegen changes. The foundation for closing the table-heavy gap.

March 2026
The Last Thirty Percent

Slot reuse guard fix, type-specialized arrays, FMADD/FMSUB fusion. Mandelbrot drops 33%. binary_trees runs for the first time. And the Value 24B wall becomes undeniable.

March 2026
The Cold Code Revolution

BOLT-style cold code splitting turns fn_calls from 70% behind LuaJIT to 5% behind. Seven optimizations, six bug fixes, one failed experiment, and the day we nearly caught LuaJIT on function calls.

March 2026
What the Academics Know That We Don't

Five cutting-edge techniques surveyed: Copy-and-Patch, Deegen, MLGO, BOLT, and NaN-boxing. Which ones actually help close the LuaJIT gap?

March 2026
The Day We Beat LuaJIT

fib(20): GScript 24µs, LuaJIT 26µs. Function inlining, register pinning, and the moment we saw the number.

March 2026
The 4.2× Wall

We hit a wall at 4.2x behind LuaJIT. 50 instructions per iteration vs theoretical 15. The path forward: live-range register allocation.

March 2026
The Blacklist That Changed Everything

One simple blacklist turned mandelbrot from 1.53x to 6.09x. How a million wasted recording attempts were killing performance.

March 2026
The Profiler Told Us We Were Wrong

Our trace JIT makes most programs slower. Four research agents, one bombshell, and a new plan.

March 2026
The Day I Wasted Chasing a Fake 88x

The ×88 was wrong. Eight bugs, one humbling day, and the real path to 10x.

March 2026
Introducing SSA IR

Why every successful JIT uses SSA. Our IR design, type specialization, and integer unboxing.

March 2026
From Interpreter to Tracing JIT

18 optimizations, 2.4x result. What worked, what didn't, and why we need SSA.

vs LuaJIT (warm benchmarks) — post-NaN-boxing

NaN-boxing landed (Value 24B → 8B). JIT codegen not yet adapted — regressions expected. Recovery phase ahead.

BenchmarkGScriptLuaJITResult
fib(20)47.9µs32.0µs1.5× gap
fn calls (10K)4.14µs3.1µs1.3× gap
ackermann(3,4)40.2µs15.2µs2.6× gap

Full Suite (15 benchmarks)

ProgramVMJITTraceBestLuaJIT
fib(35)1.187s0.036s0.037s0.036s0.032s
sieve(1M)0.284s0.025s0.026s0.025s0.014s
mandelbrot1.509s1.500stimeout1.500s0.072s
ackermann0.201s0.011s0.011s0.011s0.008s
matmul1.163s1.161s1.400s1.161s0.029s
spectral_norm0.871s0.762s0.775s0.762s0.009s
nbody2.473s2.565s2.469s2.469s0.043s
fannkuch(9)0.662s0.665stimeout0.662s0.025s
sort(50K)0.207s0.207stimeout0.207s0.016s
sum_primes0.029s0.027s0.910s0.027s0.002s
mutual_recursion0.150s0.245s0.259s0.150s0.005s
method_dispatch0.093s0.123s0.127s0.093s0.000s
closure_bench0.071s0.073s0.075s0.071s0.012s
string_bench0.051s0.052s0.187s0.051s0.010s
binary_trees2.385scrashtimeout2.385s0.17s

Warm Micro-benchmarks (JIT vs VM)

BenchmarkJITVMSpeedup
FunctionCalls(10K)4.14µs586.3µs×141.6
HeavyLoop38.1µs2201.0µs×57.8
FibRecursive(20)47.9µs1669.2µs×34.8
Ackermann(3,4)40.2µs734.7µs×18.3
FibIterative(30)283.5ns1110ns×3.9

NaN-boxing Impact (24B → 8B)

BenchmarkBefore (24B)After (8B)Change
JIT FibRecursive(20)19.4µs47.9µs-147%
JIT FunctionCalls(10K)2.66µs4.14µs-56%
JIT HeavyLoop25.8µs38.1µs-48%
sieve(1M x3) best0.080s0.025s+69%
mandelbrot best0.155s1.500strace broken
binary_trees best1.255s2.385s-90%