Deegen: A JIT-Capable VM Generator for Dynamic Languages

Rochus · on Nov 19, 2024

> We implement LuaJIT Remake (LJR)[...] using Deegen. Across 44 benchmarks, LJR's interpreter is on average 179% faster than the official PUC Lua interpreter, and 31% faster than LuaJIT's interpreter.

Well, LuaJIT in JIT mode is about factor 3 faster on average than LuaJIT in interpreter mode (depending on the benchmark up to ten times). And LuaJIT in JIT mode is e.g. factor 8 faster on average than PUC Lua 5.1 (see e.g. http://software.rochus-keller.ch/are-we-fast-yet_Lua_results... for more information). So if Degen is factor 2 faster than PUC Lua or factor 1.3 faster than the LuaJIT interpreter, this is not very impressive. But since the LuaJIT interpreter is written in assembler, we might conclude that the speed-up of a manual assembler implementation compared to a generated interpreter is about 30%. Therefore it's no longer worth the effort to implement an interpreter in assembler (even less if we consider cross-platform migration costs). But on the other hand the Degen generated VM is significantly slower than e.g. the Mono VM or CoreCLR in JIT mode (see e.g. https://github.com/rochus-keller/Oberon/blob/master/testcase...).

mananaysiempre · on Nov 24, 2024

> we might conclude that the speed-up of a manual assembler implementation compared to a generated interpreter is about 30%[, t]herefore it's no longer worth the effort to implement an interpreter in assembler

You got that backwards. The paper reports Deegen’s generated interpreter is faster than LuaJIT’s handwritten one by 30%. That’s actually pretty impressive—and pretty impressively straightforwardly achieved[1], TL;DR: instruction dispatch via tail calls avoids the pessimized register allocation that you get for a huge monolithic interpreter loop.

[1] https://sillycross.github.io/2022/11/22/2022-11-22/

CodeArtisan · on Nov 24, 2024

    # decode next bytecode opcode
    movzwl      8(%r12), %eax
    # advance bytecode pointer to next bytecode
    addq        $8, %r12
    # load the interpreter function for next bytecode
    movq        __deegen_interpreter_dispatch_table(,%rax,8), %rax
    # dispatch to next bytecode
    jmpq        *%rax

You may reduce that even further by pre-decoding the bytecode: you replace a bytecode by the address of the its implementation and then do (with GCC extended goto)

  goto *program_bytecodes[counter]

dkersten · on Nov 24, 2024

I've been playing around with this and its worth noting that pre-decoding the bytecode because it means every instruction (without operands) is the width of a pointer (8 bytes on x86) which means you fit far fewer instructions into cache, eg my opcodes are a byte, so that's 8x more instructions. I haven't had time to compare it in benchmarks to see what the real world difference is, but its worth keeping in mind.

Somewhat off topic, looking at that assembly... mine compiles to (for one of the opcodes):

    movzx  eax,BYTE PTR [rdi]
    lea    r9,[rip+0x1d6fd]        # 2ae30 <instructions_table>
    mov    rax,QWORD PTR [r9+rax*8]
    inc    rdi
    jmp    rax

(also compiled from C++ with clang's musttail annotation)

versteegen · on Nov 24, 2024

I have wondered whether it's worth storing instruction offsets (from the first instruction) rather than raw instruction pointers to increase cache efficiency, then they could be encoded in just 2 (or at worst 3) bytes. At the cost of an extra register.

dkersten · on Nov 24, 2024

That sounds like a good middle ground. Ahh I wish I had some more time, I’d love to benchmark all three and see how they compare.

In my own you VM, I do actually have a spare general purpose registers available for use. Now I just need to find the time to try it out…

fweimer · on Nov 24, 2024

They still have register allocation issues:

> Register shuffling to fulfill C calling convention when making a runtime call.

Not sure how common is that in their benchmarks because it's tempting to handle everything frequently used as bytecode.

Rochus · on Nov 24, 2024

> Deegen’s generated interpreter is faster than LuaJIT’s handwritten one by 30%

That's exactly what I've written. But apparently I got that wrong with their baseline JIT.

mananaysiempre · on Nov 24, 2024

> That's exactly what I've written.

In that case I don’t get the logic. “It’s no longer worth the effort to handcode an interpreter because that’d only be 30% faster” is a sentiment I could understand. “It’s no longer worth the effort to handcode an interpreter because that’d be 30% slower” I can’t. It’s that it’d be worth or not worth the effort—it’s that it’s actively detrimental! (For this particular application anyway.)

> Also note that the Deegen "interpreter" uses a "baseline JIT".

What? No it doesn’t? Unless the paper is deliberately misleading, they are completely different modules (utilizing the same set of bytecode definitions). The paper explicitly describes them as implementing the first two tiers of a three-tier architecture—two different tiers. Not once does the description of the interpreter in section 6 mention JITting anything. Figures 26–27 show e.g. array3d on “LJR (interpreter only)” is at 3× PUC Lua speed (same as “LuaJIT (interpreter only)”), while on “LJR (baseline JIT)” it’s at 7× PUC Lua speed (compared to 30× on “LuaJIT”).

Rochus · on Nov 24, 2024

English is not my native language; I probably should have written "that the speed-up *from* a manual assembler implementation compared to a generated interpreter is about 30%"; the point is that the speedup is small but at least demonstrates that assembler programming apparently isn't worth it any longer.

> Unless the paper is deliberately misleading,

Apparently I misinterpreted their paper concerning the JIT; as pointed out by others they indeed run separate measurements with baseline JIT on and off; so apparently it was off for the measurement I referred to. All in all it confirms that even for the JIT case assembler programming isn't worth it.

igouy · on Nov 19, 2024

"the disassembly of the Deegen-generated interpreter, baseline JIT, and the generated JIT code rivals the assembly code hand-written by assembly experts in state-of-the-art VMs."

Rochus · on Nov 19, 2024

Apparently they compare their JIT with the LuaJIT interpreter. I would be impressed if their JIT was 30% faster on average than LuaJIT in JIT mode. The Graal/Truffle generated VMs are much faster (see e.g. http://software.rochus-keller.ch/awfy-bun-summary.ods).

mananaysiempre · on Nov 24, 2024

> Graal/Truffle generated VMs are much faster (see e.g. [link]).

Faster than what? I don’t see any mention of any kind of Lua in that table or in the page it mentions. It’d be awesome[1] if Graal could outdo LuaJIT on Lua, and I was initially excited to learn that it did, but I don’t see anything about that there.

[1] Or as awesome as it’s possible to be for something that Oracle evidently intends to patent to the gills, anyway.

Rochus · on Nov 24, 2024

You can e.g. use Node.js as a reference and compare with these results: https://github.com/rochus-keller/Oberon/blob/master/testcase.... LuaJIT with JIT on is usually factor two slower than Node.js and factor five slower than C/C++ in my measurements.

tekknolagi · on Nov 24, 2024

No, they also compare JIT to JIT and theirs is 30% slower. But it's only a baseline JIT and they'll have optimizing soon (tm).

mananaysiempre · on Nov 24, 2024

It’s also a completely different class of JIT: a method-at-a-time one, not a tracing one. As I’ve mentioned in a thread some time ago, this is a very impressive project that is a JIT for Lua, but it has so little to do with LuaJIT’s architecture otherwise that calling it LuaJIT Remake feels actively misleading. It’s SpiderMonkey for Lua, if anything.

Rochus · on Nov 24, 2024

Ok, apparently I misinterpreted the text; it would be easier to unserstand if they used factors instead of percentages. Actually it's pretty close to LuaJIT in JIT performance, which calls into question both the use of manual assembler programming and the huge effort and complexity of the handmade tracing JIT. However, it is not yet clear to me whether there are other factors, so that, for example, the results on 32-bit architectures would be better for LuaJIT. Or maybe they profit of the fact that there are still a lot of operations in LuaJIT not supported by the JIT (such as FNEW, which is detrimental for all applications depending on closures), and their baseline JIT supports them in contrast.