March 2026 — Beyond LuaJIT, Post #12
Post #11 ended on a high note: NaN-boxing shrunk every Value from 24 bytes to 8. Sieve got 3.2x faster. The table-heavy future looked bright.
But the compute benchmarks told a different story. FibRecursive regressed 41%. Ackermann regressed 16%. HeavyLoop regressed 48%. The 8-byte Value was paying a tax on every integer operation — a tax that didn’t exist in the old 24-byte layout.
The tax is box/unbox.
In the old 24-byte layout, an integer lived at a fixed offset:
LDR X0, [regRegs, slot*24 + 8] // load .data field: 1 instruction
STR X0, [regRegs, slot*24 + 8] // store: 1 instruction, type tag already set
In NaN-boxing, the integer is encoded in the value itself. Reading it requires sign-extending the 48-bit payload. Writing it requires masking and OR-ing the tag:
// Unbox (read): extract 48-bit signed int
LDR X0, [regRegs, slot*8]
SBFX X0, X0, #0, #48 // sign-extend bits 0-47
// Box (write): create NaN-boxed int
MOVZ X1, #0xFFFE, LSL #48 // load tag constant
UBFX X0, X0, #0, #48 // mask to 48 bits
ORR X0, X0, X1 // set tag
STR X0, [regRegs, slot*8] // store
Three extra instructions per write. One extra per read. For a benchmark like fib(35) with 14.9 million recursive calls, each touching 2-3 integer values per call, that’s ~100 million wasted instructions.
We attacked the problem from every angle. The audit identified 7 optimization targets across 3 tiers (method JIT, trace JIT, VM interpreter).
The int tag constant 0xFFFE000000000000 was being reloaded on every box operation. We pinned it in X24 at function entry, saving one MOVZ instruction per box. Applied across all three JIT tiers.
Before (3 instructions):
MOVZ scratch, #0xFFFE, LSL #48 // load tag
UBFX dst, src, #0, #48 // mask
ORR dst, dst, scratch // set tag
After (2 instructions):
UBFX dst, src, #0, #48 // mask
ORR dst, dst, X24 // set tag (X24 = pinned constant)
Extracting a 44-bit pointer from a NaN-boxed value required loading a 64-bit mask constant:
Before (4 instructions):
MOVZ scratch, #0xFFFF // 3 instructions for 0x00000FFFFFFFFFFF
MOVK scratch, #0xFFFF, LSL #16
MOVK scratch, #0x0FFF, LSL #32
AND dst, src, scratch
After (1 instruction):
UBFX dst, src, #0, #44 // extract bits 0-43
One instruction. Applied at 19 call sites across method JIT, trace JIT, and SSA codegen.
The trace JIT’s FORLOOP was loading, unboxing, computing, boxing, and storing every iteration — 14 instructions for what should be 3:
Before:
LDR X0, [regs, idxOff] // load idx
SBFX X0, X0, #0, #48 // unbox
LDR X1, [regs, stepOff] // load step
SBFX X1, X1, #0, #48 // unbox
ADD X0, X0, X1 // compute
UBFX X2, X0, #0, #48 // box: mask
ORR X2, X2, X24 // box: tag
STR X2, [regs, idxOff] // store idx
STR X2, [regs, loopVarOff] // store loop var
LDR X1, [regs, limitOff] // load limit
SBFX X1, X1, #0, #48 // unbox
CMP X0, X1 // compare
B.GT loop_done // exit
After (when idx/step/limit are register-allocated):
ADD idxReg, idxReg, stepReg // idx += step
CMP idxReg, limitReg // compare
B.GT loop_done // exit
Plus a memory writeback for correctness (other ops still read from memory), but the core loop drops from 14 to ~6 instructions.
When both operands of an ADD/SUB/MUL are register-allocated, the trace now computes directly in registers instead of loading from memory.
The Int() and RawInt() methods used a branch for sign extension:
raw := uint64(v) & payloadMask
if raw & (1<<47) != 0 {
return int64(raw | 0xFFFF000000000000)
}
return int64(raw)
Replaced with branchless arithmetic shift:
return int64(uint64(v) << 16) >> 16
One line. The Go compiler turns this into LSL + ASR — two instructions, zero branches. For the VM dispatch loop, which calls Int() on every arithmetic operation, eliminating the branch prediction overhead matters.
The VM’s FORLOOP counter is guaranteed to be in the 48-bit range (it’s initialized from a valid integer and incremented by a valid integer). The range check in SetInt is wasted work:
// Before: checks every iteration
func (v *Value) SetInt(i int64) {
if i > maxInt48 || i < minInt48 { ... }
*v = Value(tagInt | (uint64(i) & payloadMask))
}
// After: FORLOOP uses unchecked version
func (v *Value) SetIntUnchecked(i int64) {
*v = Value(tagInt | (uint64(i) & payloadMask))
}
The arithmetic helper functions were calling v.Int() (value receiver, copies) and IntValue() (range check). Changed to v.RawInt() (pointer receiver, branchless) and v.SetInt() (in-place mutation):
// Before
*dst = IntValue(a.Int() + b.Int())
// After
dst.SetInt(a.RawInt() + b.RawInt())
During this work, we discovered that the NaN-boxing migration had introduced a subtle bug: Go’s zero value for uint64 is 0, which corresponds to float64(0.0) in IEEE 754 — not nil (0xFFFC000000000000).
Every make([]Value, n) in the codebase created arrays where uninitialized slots appeared as 0.0 instead of nil. The Chess AI benchmark crashed because board[key] returned 0.0 for empty squares, passed the != nil check, and then attempted to index a number.
Fix: MakeNilSlice(n) helper that pre-fills with NilValue(). Applied to table expansion and VM register initialization.
All measurements on Apple M4 Max (under moderate system load):
| Benchmark | Before | After | Improvement |
|---|---|---|---|
| JIT FibRecursive(20) | 47.9us | 23.9us | 50% faster |
| JIT Ackermann(3,4) | 40.2us | 21.9us | 46% faster |
| JIT FibIterative(30) | 284ns | 172ns | 39% faster |
| JIT FunctionCalls(10K) | 4.1us | 2.6us | 38% faster |
| JIT HeavyLoop | 38.1us | 25.0us | 34% faster |
| VM FibRecursive(20) | 1669us | 821us | 51% faster |
| VM HeavyLoop | 2201us | 1000us | 55% faster |
The NaN-boxing regression is fully recovered and then some. FibRecursive is now faster than pre-NaN-boxing (23.9us vs 19.4us pre-NaN, vs 47.9us post-NaN), back in striking distance of LuaJIT (25us).
The box/unbox toll is paid. Season 2.1 — the NaN-boxing foundation — is complete. The runtime moves 8 bytes per Value, the JIT generates tight box/unbox sequences, and the VM dispatch loop is branchless on the hot path.
Season 2.2 is the custom heap: mmap-based arena allocation for Tables, Strings, and Closures. The gcRoots safety net (which keeps every allocated pointer alive forever) must be replaced with a proper mark-sweep collector. This is where the table-heavy benchmarks — matmul at 40x, spectral_norm at 85x, nbody at 57x — will finally start closing the gap.
Beyond LuaJIT — a series about building a JIT compiler from scratch.