Beyond LuaJIT

Notes on building a JIT compiler for GScript.

Posts

March 2026
The Box/Unbox Toll

NaN-boxing regressed compute benchmarks by 16-48%. Seven optimizations across three JIT tiers and the VM eliminate the overhead: pinned tag register, UBFX pointer extraction, trace FORLOOP register pinning, branchless sign extension. FibRecursive recovers from 47.9us to 23.9us.

March 2026
Eight Bytes That Change Everything

Season 2 begins. NaN-boxing shrinks Value from 24 bytes to 8. IEEE 754 quiet NaN payload, pointer sub-types, gcRoots safety net, and 210+ JIT codegen changes. The foundation for closing the table-heavy gap.

March 2026
The Last Thirty Percent

Slot reuse guard fix, type-specialized arrays, FMADD/FMSUB fusion. Mandelbrot drops 33%. binary_trees runs for the first time. And the Value 24B wall becomes undeniable.

March 2026
The Cold Code Revolution

BOLT-style cold code splitting turns fn_calls from 70% behind LuaJIT to 5% behind. Seven optimizations, six bug fixes, one failed experiment, and the day we nearly caught LuaJIT on function calls.

March 2026
What the Academics Know That We Don't

Five cutting-edge techniques surveyed: Copy-and-Patch, Deegen, MLGO, BOLT, and NaN-boxing. Which ones actually help close the LuaJIT gap?

March 2026
The Day We Beat LuaJIT

fib(20): GScript 24µs, LuaJIT 26µs. Function inlining, register pinning, and the moment we saw the number.

March 2026
The 4.2× Wall

We hit a wall at 4.2x behind LuaJIT. 50 instructions per iteration vs theoretical 15. The path forward: live-range register allocation.

March 2026
The Blacklist That Changed Everything

One simple blacklist turned mandelbrot from 1.53x to 6.09x. How a million wasted recording attempts were killing performance.

March 2026
The Profiler Told Us We Were Wrong

Our trace JIT makes most programs slower. Four research agents, one bombshell, and a new plan.

March 2026
The Day I Wasted Chasing a Fake 88x

The ×88 was wrong. Eight bugs, one humbling day, and the real path to 10x.

March 2026
Introducing SSA IR

Why every successful JIT uses SSA. Our IR design, type specialization, and integer unboxing.

March 2026
From Interpreter to Tracing JIT

18 optimizations, 2.4x result. What worked, what didn't, and why we need SSA.

vs LuaJIT (warm benchmarks) — post-NaN-boxing

NaN-boxing landed (Value 24B → 8B). JIT codegen not yet adapted — regressions expected. Recovery phase ahead.

Benchmark	GScript	LuaJIT	Result
fib(20)	47.9µs	32.0µs	1.5× gap
fn calls (10K)	4.14µs	3.1µs	1.3× gap
ackermann(3,4)	40.2µs	15.2µs	2.6× gap

Full Suite (15 benchmarks)

Program	VM	JIT	Trace	Best	LuaJIT
fib(35)	1.187s	0.036s	0.037s	0.036s	0.032s
sieve(1M)	0.284s	0.025s	0.026s	0.025s	0.014s
mandelbrot	1.509s	1.500s	timeout	1.500s	0.072s
ackermann	0.201s	0.011s	0.011s	0.011s	0.008s
matmul	1.163s	1.161s	1.400s	1.161s	0.029s
spectral_norm	0.871s	0.762s	0.775s	0.762s	0.009s
nbody	2.473s	2.565s	2.469s	2.469s	0.043s
fannkuch(9)	0.662s	0.665s	timeout	0.662s	0.025s
sort(50K)	0.207s	0.207s	timeout	0.207s	0.016s
sum_primes	0.029s	0.027s	0.910s	0.027s	0.002s
mutual_recursion	0.150s	0.245s	0.259s	0.150s	0.005s
method_dispatch	0.093s	0.123s	0.127s	0.093s	0.000s
closure_bench	0.071s	0.073s	0.075s	0.071s	0.012s
string_bench	0.051s	0.052s	0.187s	0.051s	0.010s
binary_trees	2.385s	crash	timeout	2.385s	0.17s

Warm Micro-benchmarks (JIT vs VM)

Benchmark	JIT	VM	Speedup
FunctionCalls(10K)	4.14µs	586.3µs	×141.6
HeavyLoop	38.1µs	2201.0µs	×57.8
FibRecursive(20)	47.9µs	1669.2µs	×34.8
Ackermann(3,4)	40.2µs	734.7µs	×18.3
FibIterative(30)	283.5ns	1110ns	×3.9

NaN-boxing Impact (24B → 8B)

Benchmark	Before (24B)	After (8B)	Change
JIT FibRecursive(20)	19.4µs	47.9µs	-147%
JIT FunctionCalls(10K)	2.66µs	4.14µs	-56%
JIT HeavyLoop	25.8µs	38.1µs	-48%
sieve(1M x3) best	0.080s	0.025s	+69%
mandelbrot best	0.155s	1.500s	trace broken
binary_trees best	1.255s	2.385s	-90%