LZMA compression (used in 7-zip, XZ, LZMA2) is known as one of the best, but it has a noticeable drawback - it's slow. I tried to improve the decompression speed by removing excessive branching at decoding of every bit from the compressed data.
Decompression speedup from this patch largely depends on the compression ratio, more ratio - less speedup. Compressed text, such as source code, gives the least speedup.
That's the result from my Skylake, compiled with GCC. Please help me with testing on different x86 CPUs.
x86 (32-bit) - should work, but haven't tested yet.
Compiled with Clang should work as well.
I find it sad that you need to use inline asm for this.
Why isn't GCC providing a built-in to reliably use cmov? Every time I use ?: it's a lottery whether I get it or not. I've had to resort to using the slower bitwise logic to get branch-free code reliably.
I think I’ve observed reliably-emitted CMOVs when using equivalent (but contrived) arithmetic branchless idioms, e.g. instead of (a ? b : c) write (-!!a & b | -!a & c) (yes, this is rather repulsive, but less so if you think about the negation as turning C-style booleans into Forth/BASIC/mask-style booleans).
It’s not, it turns out, I misremembered :( Clang 13 does compile this down to a CMOV in my tests, but GCC 11.1 just translates it as is even with -mtune=native (on Haswell).
Cmov is, and was, efficient whenever branch prediction is unreliable.
Clang will turn eligible ?: expressions into cmov. Gcc will do no more than one of those per basic block, subject to fragile conditions. Gcc got its fingers badly burned by overuse of cmov. [Edit: I am wrong! Gcc would not do it when I was trying. More research needed.]
Generally, branches break dependency chains, enabling more implicit speculative parallelism. Older cores stuck cmov ops onto dependency chains, where more recently they are treated more like branches. That gives you less speculative evaluation, but consumes less state that would be discarded on a mis-predicted branch.
Are dependency chains complex functions or a few instructions? For example, I see people using ternary expressions “a > b : c ? d”. I’m trying to understand if this is an algorithmic problem or a microarchitecture problem.
Are “c” and “d” simple? If both are heavy-duty and approximately the same complexity, evaluating both would be 2x slower than evaluating one, but then you’d have to account for probability of misprediction. Meanwhile, if they are simple, I could imagine the CPU tracking both branch register-level states simultaneously with minimal impact on performance (using register-renaming, for example).
In other words, how much better would CMOV be if the hardware was given more micro architecture resources?
When I've seen CMOVcc, my mind jumped to the callcc name and now I can't stop thinking in what absurd way could you create an actual CMOV-with-current-continuation. Conditional-MOV-value-or-EIP-otherwise seems too trivial.
Decent speedup on my 3970x. Including this patch reduced the number of instructions in lzma_decoder.s (using gcc 8.3.1) by about 8% (4417 lines of asm vs 4824). With perf stat, an astonishing branches-missed reduction from 409K to 104K.
Hm, should the same be done for aarch64? Generally compilers (well, clang) seem to use CSEL all the time anyway, would be interesting to investigate this.
Can be done, ARM has better support for predicated instructions. May be done without using inline assembly.
But I guess ARM CPUs can also have a smaller branch penalty, which means less speedup from such a patch.
You can try it by replacing the inline assembly with the commented code above it (also don't forget to remove i386 and x86_64 from #if). (Although the code could be rewritten a bit to help the compiler make better binary code for ARM.)
Maybe on more complex ARM processors the results will be better.
Update: It looks like I need to use inline assembly for AArch64 or GCC where it can make two CSEL instructions from the same condition - replaces them with if-else branch.
And I got better results (below) than when tried to avoid this compiler behavior but didn't use inline assembly (results above).
I tried it and realized that predicated instructions are removed from aarch64, which I was not aware of (because I rarely work with ARM). That sucks. But I'll try to do it without them.
Well, ideally, compilers should be compiling into CMOV vs Branching decisions much better.
I'm shocked, but not that shocked (https://tenor.com/view/shocked-gif-5787388), that compilers today still are weaker than a dedicated raw-assembly programmer in these kinds of microarchitectural decisions. Even with what should be normal 64-bit code with compilers that have very good modeling of throughputs / latencies per instruction.
---------
I'm now more curious as to which compilers can turn the raw C-code into a cmov and which compilers turn the code into the less efficient form (I assume branching??)
Correctly-predicted branching is better; but if the branch would have been mis-predicted, cmov is better. The compiler has no idea which will happen at runtime.
Gcc will never under any circumstances generate two cmov instructions in a basic block, even when you definitely want that. Clang will. [Edit: I am wrong. Gcc can do more than one cmov in a basic block. Just not when I was trying it!]
In principle, profile-guided optimization could help, if the instrumented run is representative of production behavior. In my tests it did not affect the cmov/branch choice. Neither did the "% expected" intrinsic. Either of those could change in any release.
The Zstd and Lz4 decoders make very effective use of cmov. They should be your model.
I'm not a programmer by profession and barely one by hobby. I know nothing about compiler internals but looking at the definition of a basic block from Wikipedia, "In compiler construction, a basic block is a straight-line code sequence with no branches in except to the entry and no branches out except at the exit." it seems that the ternary operators in "return (x ? a : b) + (y ? c : d);" although part of the same programming block would be distinct compiler basic blocks.
Basic blocks are how compilers "conceptualize" our code and start thinking about optimization.
Frankly, my recommendation to you is to ignore that stuff. You need like 3 or 4 years of school to reach that level. You should have studied easier graph optimization problems (ex: Finite Automatia) first, before dealing with code-optimization.
The school curriculum is typically Programming 101 -> Data Structures -> Algorithms -> Finite Automata (aka: Regex) -> Pushdown Automata (aka: context-free grammars) -> Turing Machines (aka: general purpose code) -> Compilers (which use Finite Automata, Pushdown Automata, and Turing Machines simultaneously).
Basic Blocks are the graph structure that compilers use to think about code. Its a lot of graph-theory built up on a lot of language theory.
------
That being said: if you're actually interested in this stuff, then feel free to explore. But just beware, you've stumbled upon a very complex subject that is easily 4th year undergrad or even graduate-school level.
With enough effort, you'll understand things. But this is no subject for beginners to go around exploring by themselves.
-------
That being said, I want to encourage you to explore the subject anyway. Just explore the subject with awareness that there's a lot to understand here.
If you want to skip the 3ish years of elementary data-structures / comp. sci theory... I suggest starting with Static Single Assignment and working forward from here.
Thank you. It appears Gcc won't put two cmov writes to memory in a block. Thus,
void swap_if(bool c, int& a, int& b) {
int ta = a, tb = b;
a = c ? tb : ta;
b = c ? ta : tb;
}
is very slow, under Gcc, when c is poorly predicted, as is typical when e.g. partitioning for quicksort. But how well it will be predicted depends on input data.
To me it looks like something related to some other optimization pass (I don't know much about gcc passes). But not related to writes to memory. Here are two writes both using cmov (on different code): https://godbolt.org/z/n3dTrPo6e
In my experience, the key is that you've triggered two decisions off of a single boolean condition. A single ternary will be a cmov, but two ternaries become a branch.
Using CMOV vs. branching seems like a strategic decision which does make sense to fall under the programmer's decision making, as there is a trade-off of always calculating a value which may be unused vs. the cost of the conditional branch.
> as there is a trade-off of always calculating a value which may be unused vs. the cost of the conditional branch.
I don’t really understand the point you’re trying to make.
Figuring out if a value is unused is most definitely the purview of an optimizer.
Also, the “calculating a value” isn’t really the trade off being made between cmov and branching.
> I don’t really understand the point you’re trying to make. Figuring out if a value is unused is most definitely the purview of an optimizer.
If the conditional move doesn't happen, then the source (insofar as the move is concerned) is unused.
Consider this pseudocode:
int value = some_nontrivial_function_with_no_side_effects();
if (condition)
*target = value;
Note that the function can be as simple as a memory read.
The compiler could compile this in two ways:
1. Observing that the function's result is used only if condition is true, move the function call inside the if block.
2. Always call the function, as in the source code, but compile the if block to a conditional move.
In such situations, it would make sense to allow programmers to indicate the desired strategy to the compiler.
I suppose CPUs might elide calculating the value even with a conditional move if they can predict the condition is [likely to be] false; I don't know how true that is in practice.
> Also, the “calculating a value” isn’t really the trade off being made between cmov and branching.
Depending on the situation and interpretation of terms, I also agree.
> In such situations, it would make sense to allow programmers to indicate the desired strategy to the compiler.
If the function truly has no side effects then why would the programmer care which strategy was used other than cost? This is just an optimization that can be mechanistically applied. And if it does have side effects then the two constructs are not equivalent - one would just put the function call inside the conditional.
This is the only use of assembly in this entire diff. That "tmp = rc.code < rc_bound ?..." statement looks like it'd probably be a cmov, at least I'd expect it to compile to cmov without much issue.
However, this is some advanced "carry-flag manipulation" stuff going on here. I'm not sure if its the "cmov" per se that was advanced, as much as the sbbl statement (subtract borrow, the subtraction-analog to adc). sub %0, %0 is obviously "zero", but sbbl %0, %0 is "0xFFFFFFFF" if carry is 1.
Its certainly a very well thought out bit of manual assembly language. The sbb is probably more important than the cmov (in that the compiler probably emits the cmov, but may not see the sbb????)
I don't know what its doing, but the cmov stuff is very, very different entirely. There doesn't seem to be a branch involved at all in this "rc_direct" inner-loop.
Its not very clear to me how they saw this sequence of C, and the decided upon a cmov / sbb approach to do this equivalent work. Its clearly some kind of advanced thinking that no compiler would have gotten.
So it's merging the two sides of rc_update and cmov/sbb handle the difference. action0/action1 are generally blank, but rc_bit_matched makes the common action branchless as well.
Realizing that the two sides of the "if" statement are effectively stored as a 1-bit of information, and then using instructions to store that 1-bit into the carry flag, and then using cmovcc / sbbl instructions to manipulate that 1-bit of information is pretty advanced.
I could imagine a compiler making use of all that information, but only maybe with some programmer assistance (IE: code in the right format for easier optimizing). Or you know... an assembly programmer who just outright explicitly writes that kind of assembly language.
Yeah, this sort of conversion is pretty common for the core of arithmetic/range decoders, and I definitely don't expect a compiler to be capable of the sort of transformations needed to convert the branchy form to branchless.
Though I'd expect compilers to be at least 90% as good with the ternary equivalent to the asm, e.g.:
compiles tmp to xor -> setl -> neg (clang) or setge -> movzx -> sub (gcc) instead of sbb, but on arm64 they use just one csetm (clang) or csinv (gcc) which is basically optimal, and in all cases the comparison is reused for the cmov.
If you're going for decoding speed wouldn't you be using brotli anyway? At a given compression ratio brotli is ~4x faster on decode path, and they are similar on the encode path.
Sure. In theory. But every other response to this comment are clearly talking about web stuff, and in practice brotli is really only used for web stuff.
Brotli has a static dictionary heavily geared towards web formats, and there really is no reason to use brotli outside of web stuff: the point of brotli is that it has browser support and does 15~20% better than gzip.
It's not bad per-se, but if you're not constrained to the web zstd will have a better compression ratio at every throughput (and better throughput at every compression ratio), and if you want maximum compression period it's not close to competing with LZMA-based formats (xz, lzip).
I think he mean is less popular (like zstandard and lz4) compared to zlib and LZMA and BZip2.
this is unfortunate because Brotli and Zstandard compress much better than zLib and LZ4 and decompress much faster than zlib lzma and LZ4
I’m using the phrase “in theory” colloquially, which means something like “sure, you could technically use it that way, but nobody really does/it’s very rare”.
Jpeg-XL makes use of Brotli, and once it hits 1.0 it will most likely be turned on by default in browsers and thus you will probably see a lot of jxl in the wild, particularly since it provides lossless recompression to existing jpeg files with ~20% smaller size as a result.
I don't know about web use, but xz's major advantage for me is the seekable format. We build disk image pipelines based on remote xz-compressed images where only the parts of the file being accessed need to be downloaded and uncompressed. https://libguestfs.org/nbdkit-xz-filter.1.html
Decompression speedup from this patch largely depends on the compression ratio, more ratio - less speedup. Compressed text, such as source code, gives the least speedup.
That's the result from my Skylake, compiled with GCC. Please help me with testing on different x86 CPUs.
x86 (32-bit) - should work, but haven't tested yet. Compiled with Clang should work as well.