Can someone with domain expertise comment specifically on the criticisms regardi...

lgg · on Feb 2, 2021

In my opinion those complaints are generally correct. I work on low level OS and toolchain components (static and dynamic linkers) and interact with CPU architects. When I looked at RISC V's addressing modes I was dumbstruck, they are completely inadequate for modern high performance desktop or mobile cores. Just compare the design of PC relative branches on the two architectures:

ARM64 (b/bl): These instructions reserve 26 bits of the 32 bit instruction for a branch target (with an implicit 2 bit shift since instructions are 4 byte aligned), resulting in the ability to directly jump ±128MB (AKA ±2^25 instructions).

RISCV (jal): This instruction reserves 20 bits off the 32 bit instruction for displacement (there is an implicit shit of 1 bit since all RISCV instructions are 2 byte aligned). This results in the ability to jump ±1MB (AKA ±2^19 instructions).

This is a often a non-issue for small embedded cores because the code running on them is fairly compact and can be tuned for specific cores. It is a nightmare for large desktop and UI stacks (or web browsers) which often have many linked images and are much larger than 2MB. You can make it all work, but you need to add extra address calculation instructions or branch islands to do it, and those waste a bunch of space (what arm64 can do in a single instruction requires 2 or 3 on RISCV). Now you have all those extra instructions in your I-cache, extra jumps to branch islands wasting predictor slots, etc. You can try to solve this in hardware by adding special predictors to recognize branch islands, or using do op fusion to recognize idiomatic jump calculations, but that makes the chips more complex and still does not solve the code density issue (you can try to overcome some of that with trace caches, but that is again more complexity).

There is no simple way to fix these issues in RISCV, because all prime encoding space is gone. The best you can do is add better addressing modes in the 48 bit opcode space, but that interoduces significant code bloat (if you just make every unresolved target in a .o file use 48 bit jump instructions), buys you very little (if you continue using 32 bit instructions by default and only use the newer instructions in linker generated branch islands), or requires complex software, binary format, and tooling changes that are never likely to happen in order to dynamically relax function bodies and have the linker choose the size of the instructions (and the real kicker is any improvements made to toolchains to accomplish this would still not overcome arm64's better instruction design BUT would provide some improvements to arm64 binaries).

I could go into a similiar analysis of pc relative load instructions and how adrp is much better than auipc for large codebases. RISCV just wastes tons of bits in prime places in the encoding space. JAL blows 5 bits on encoding a return register. Technically that is more generic and orthogonal than having an architecturally specified return register that the instruction uses implicitly, but those 5 bits are incredibly valuable, they would have increased the displacement from ±1MB to ±32MB. Yes, specifying the register lets them play fun tricks in their calling conventions to simplify their prologue and epilogue code, but that really cannot justify the loss of branch reach. What is so infuriating is that they had an instruction like that (J), but they removed it because they did not want any instructions to use implicit registers (and I believe it cannot be added back because the encoding space has been reused). I understand the desire for architectural purity, but by doing so they doomed every high performance implementation to micro-architectural chaos.

While it might be tempting to think I have just rabbit holed on a single issue, it really is a big deal. Something like 5-10% of generated instructions PC-relative jumps, so getting them wrong has significant impacts... I would estimate this one issue alone will result in a ~5-10% code size increase (but only once you start have large binaries, it does not have an impact for anything less than ~1-2MB in size). It might not matter for small embedded controllers or in order cores, but it makes implementing high performance out of order cores much more complex. It is certainly possible to overcome these issues, but it means that for out of order RISCV cores to achieve similiar performance to ARM64 cores on large codebases they will need larger more complex branch predictors, larger caches, and extra decode logic, and potentially trace caches. This is not an isolated issue, it is just one where I have domain expertise, I have heard similiar criticisms of other parts of the instruction set from people who work in other parts of the stack.

Just to be clear, I don't think RISCV is terrible. I think it will be great for people doing custom cores with custom toolchains to ship bespoke silicon in small devices where the general purpose compute requirements are low. IOW, it is great if you just need some sort of CPU core but that is not what is really special about your silicon. On the other hand, I simply do not see it ever being competitor to arm64 in high end mobile devices, desktops, workstations, or servers. In order to fix the issues with it they would need to reclaim a bunch of allocated instruction encodings (maybe RISC6)?

brucehoult · on Feb 3, 2021

What I'm hearing in this post is "Where RISC-V uses a smaller field size for something than ARM64 they have under-provisioned and will need extra instructions, and where RISC-V uses a larger field size for something than ARM64 it will never be used and is wasted".

In other words, ARM's architects chose every parameter correctly, and RISC-V's chose every parameter badly.

It might be true, but you kind of have to prove it, not just assert it.

Take the J/JAL vs B/BL range for example. It's not just embedded. Look through your Linux distro's binaries and you'll find very few with TEXT size over 2 MB.

One of the few on my (x86_64, but it doesn't really matter) system is /opt/google/chrome/chrome with 159,885,517 bytes.

That exceeds ARM64's single-instruction BL range, as well as RISC-V's.

A quick analysis shows that of the 373500 callq instructions in the binary, statically 55.53% fall within the RISC-V single-instruction range, 100% fall within the ARM64 range. Dynamically I don't know, but I suspect the percentage that fall in the RISC-V range would be a lot higher.

Anyway .. code size. The extra AUIPC instructions needed in the RISC-V program will make it around 648 KB or 0.4% larger than the ARM64 program.

That's a pretty far cry from "I would estimate this one issue alone will result in a ~5-10% code size increase" you state. I mean -- that's a factor of 12x to 25x different than you state.

But wait there's more.

One reason that RISC-V JAL offsets are limited is that 2 bits out of 32 are taken up by indicating whether the current instruction is 4 bytes or 2 bytes in size.

So that's a waste right?

It would be if you didn't use it. But RISC-V does use it. In a typical RISC-V program, around 50% to 60% of all instructions use a 2 byte opcode, giving a 25% to 30% reduction in code size.

On the same 152 MB program (Chrome) where not having long JAL offsets costs 0.6 MB of code size, the C extension will probably save around 40 to 45 MB of code size.

That seems like a pretty good trade to me.

What will the speed effect of those extra AUIPC instructions be? I don't know. I'd have to instrument Chrome and run it at a fraction of normal speed to find out.

That's definitely something that should be done before making a pronouncement that one ISA is definitely better and the other one made all the wrong trade-offs.

However, my experience of analyzing smaller programs is that the dynamic penalty (execution speed) is typically much less than the static penalty (code size). At a wild guess, I'd go with four times less, or 0.1%.

That's in the noise.

Might a RISC-V core be enough simpler than a comparable ARM64 core to clock 0.1% faster? Could well be. Might it be enough simpler to be 0.1% smaller and thus cost 0.1% less in die space -- or allow you to put 0.1% more cores on the same chip? Could well be.

Even the detractors don't argue that RISC-V isn't simpler. "It's too simple" they say, takes purity and orthogonality too far.

Maybe, but you need to prove it, not just assert it.

lgg · on Feb 3, 2021

You are correct, I am just making an assertion, but I don't have to prove it, I will be satisfied to wait and watch things play out. There is a lot of money and a many industry players working on RISC-V, so eventually the market should provide evidence to prove or invalidate my thesis.

I don't think the ARM architects did everything correctly and the RISC-V did everything wrong. I just chose an example I felt was an issue. On the other hand, I think that RISC-V supporting variable length instructions that are encoded with such that they can be easily decoded in parallel was very good use of encoding space. What frustrates me about RISC-V is that it feels like it ignored the last 20 years of industry experience and made a lot of unforced errors.

I think you are correct that I misestimated the branch density in normal linux binaries (the system I work on is a bit different), so I will take back my claim about code size increase, but I also think large binaries like Chrome are more significant than you seem to imply, especially once you start looking at desktop and mobile platforms. We can argue about people writing bloated code, but the fact is that apps like Twitter and Facebook ship mobile apps that are over a 100MB of executable code. And these things are not getting smaller over time. As code sizes increase that 2MB jump window is going to look very small.

It is going to be interesting to see how this plays out over the next few years.