> This is called branch prediction, it has been the source of many fun security issues...
No, that's speculative execution you just described. Branch prediction was implemented long before out-of-order CPUs were a thing, as you need branch prediction to make the most of pipelining (eg. fetching and decoding a new instruction while you're still executing the previous one--if you predict branches, you're more likely to keep the pipeline full).
A shame operating systems like iOS/iPadOS do not allow JIT. iPad Pro's have such fast CPU's that you cant even use fully because of decisions like this.
Good read. But a word of caution - the "JIT vs interpreter" comparisons often favor the interpreter when the JIT is inplemented as more-or-less simple inlining of the interpreter code. (Here called "copy-and-patch" but a decades-only approach). I've had fairly senior engineers try to convince me that this is true even for Java VMs. It's not in general, at least not with the right kind of JIT compiler design.
I just recently upgraded[1] a JIT that essentially compiled each bytecode separately to one that shares registers within the same basic block. Easy 40 percent improvement to runtime, as expected.
But something I hadn't expected was it also improved compilation time by 40 percent too (fewer virtual registers made for much faster register allocation).
Yeah, I expect the real advantage of a JIT is that you can perform proper register allocation and avoid a lot of stack and/or virtual register manipulation.
I wrote a toy copy-patch JIT before and I don't remember being impressed with the performance, even compared to a naive dispatch loop, even on my ~11 year old processor.
I'm not really interested in building an interpreter, but the part about scalar out of order execution got me thinking. The opcode sequencing logic of an interpreter is inherently serial and an obvious bottleneck (step++; goto step->label; requires an add, then a fetch and then a jump, pretty ugly).
Why not do the same thing the CPU does and fetch N jump addresses at once?
Now the overhead is gone and you just need to figure out how to let the CPU fetch the chain of instructions that implement the opcodes.
You simply copy the interpreter N times, store N opcode jump addresses in N registers and each interpreter copy is hardcoded to access its own register during the computed goto.
> This is called branch prediction, it has been the source of many fun security issues...
No, that's speculative execution you just described. Branch prediction was implemented long before out-of-order CPUs were a thing, as you need branch prediction to make the most of pipelining (eg. fetching and decoding a new instruction while you're still executing the previous one--if you predict branches, you're more likely to keep the pipeline full).
A shame operating systems like iOS/iPadOS do not allow JIT. iPad Pro's have such fast CPU's that you cant even use fully because of decisions like this.
Those operating systems allow it, but Apple does not. Agree that it is a total waste.
Good read. But a word of caution - the "JIT vs interpreter" comparisons often favor the interpreter when the JIT is inplemented as more-or-less simple inlining of the interpreter code. (Here called "copy-and-patch" but a decades-only approach). I've had fairly senior engineers try to convince me that this is true even for Java VMs. It's not in general, at least not with the right kind of JIT compiler design.
I just recently upgraded[1] a JIT that essentially compiled each bytecode separately to one that shares registers within the same basic block. Easy 40 percent improvement to runtime, as expected.
But something I hadn't expected was it also improved compilation time by 40 percent too (fewer virtual registers made for much faster register allocation).
[1] https://github.com/ZQuestClassic/ZQuestClassic/commit/68087d...
Yeah, I expect the real advantage of a JIT is that you can perform proper register allocation and avoid a lot of stack and/or virtual register manipulation.
I wrote a toy copy-patch JIT before and I don't remember being impressed with the performance, even compared to a naive dispatch loop, even on my ~11 year old processor.
That was a pretty interesting read.
My take is that you can get pretty far these days with a simple bytecode interpreter. Food for thought if your side project could benefit from a DSL!
I'm not really interested in building an interpreter, but the part about scalar out of order execution got me thinking. The opcode sequencing logic of an interpreter is inherently serial and an obvious bottleneck (step++; goto step->label; requires an add, then a fetch and then a jump, pretty ugly).
Why not do the same thing the CPU does and fetch N jump addresses at once?
Now the overhead is gone and you just need to figure out how to let the CPU fetch the chain of instructions that implement the opcodes.
You simply copy the interpreter N times, store N opcode jump addresses in N registers and each interpreter copy is hardcoded to access its own register during the computed goto.