Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: 10-40% faster LZMA decoder using x86 CMOVcc (gist.github.com)
205 points by jpegqs on Dec 13, 2021 | hide | past | favorite | 68 comments


LZMA compression (used in 7-zip, XZ, LZMA2) is known as one of the best, but it has a noticeable drawback - it's slow. I tried to improve the decompression speed by removing excessive branching at decoding of every bit from the compressed data.

Decompression speedup from this patch largely depends on the compression ratio, more ratio - less speedup. Compressed text, such as source code, gives the least speedup.

That's the result from my Skylake, compiled with GCC. Please help me with testing on different x86 CPUs.

x86 (32-bit) - should work, but haven't tested yet. Compiled with Clang should work as well.


I've recently watched this talk that may help you profile better : https://youtu.be/r-TLSBdHe1A


For the impatient, the profiler is available at https://github.com/plasma-umass/coz


Someone would really be doing themselves a disservice to not watch that talk. Impatient or not.


Coz is very cool although it was an absolute bitch to get it work on my fairly standard Ubuntu box maybe 2 months ago.


I find it sad that you need to use inline asm for this.

Why isn't GCC providing a built-in to reliably use cmov? Every time I use ?: it's a lottery whether I get it or not. I've had to resort to using the slower bitwise logic to get branch-free code reliably.


I think I’ve observed reliably-emitted CMOVs when using equivalent (but contrived) arithmetic branchless idioms, e.g. instead of (a ? b : c) write (-!!a & b | -!a & c) (yes, this is rather repulsive, but less so if you think about the negation as turning C-style booleans into Forth/BASIC/mask-style booleans).

It should just be a builtin.


that is what I meant by "the slower bitwise logic".

I didn't know GCC was able to optimize that down to cmov.


It’s not, it turns out, I misremembered :( Clang 13 does compile this down to a CMOV in my tests, but GCC 11.1 just translates it as is even with -mtune=native (on Haswell).


It would be great to have builtin for CMOV. But I also want GCC to utilize the hack with the SBB instruction when I write something like:

(unsigned)a < (unsigned)b ? ~0 : 0

Instead of three instructions what it generates.


It does use cmp + sbb? https://godbolt.org/z/hnvqrcTcW


Cool, but it doesn't generate SBB in real code, but instead gives me something like this:

    xorl %r11d, %r11d
    cmpl %r12d, %eax
    seta %r11b
And if I do both of these, side by side:

    tmp = rc.code < rc_bound ? ~0 : 0;
    if (rc.code < rc_bound) rc.range = rc_bound;
Then GCC generates an if-else branch.

Thus, inline assembly is the only option so far to prevent the compiler from doing something dumb.


When did CMOVs become efficient again? The x86 optimization guide used to warn people not to use them, around 2008.


Cmov is, and was, efficient whenever branch prediction is unreliable.

Clang will turn eligible ?: expressions into cmov. Gcc will do no more than one of those per basic block, subject to fragile conditions. Gcc got its fingers badly burned by overuse of cmov. [Edit: I am wrong! Gcc would not do it when I was trying. More research needed.]

Generally, branches break dependency chains, enabling more implicit speculative parallelism. Older cores stuck cmov ops onto dependency chains, where more recently they are treated more like branches. That gives you less speculative evaluation, but consumes less state that would be discarded on a mis-predicted branch.


CMOVs were terrible on Pentium 4 but ok on PPro, Pentium M, Core, Core 2, Sandybridge, Skylake, etc.

Also ok on Atom and on AMD chips.


At that point, we might just simplify this to "Pentium 4 was terrible"


https://www.agner.org/optimize/optimizing_assembly.pdf#page=...

TL;DR rule of thumb is that conditional jump is better than CMOV if the code is part of a dependency chain and the prediction rate is better than 75%.


Are dependency chains complex functions or a few instructions? For example, I see people using ternary expressions “a > b : c ? d”. I’m trying to understand if this is an algorithmic problem or a microarchitecture problem.

Are “c” and “d” simple? If both are heavy-duty and approximately the same complexity, evaluating both would be 2x slower than evaluating one, but then you’d have to account for probability of misprediction. Meanwhile, if they are simple, I could imagine the CPU tracking both branch register-level states simultaneously with minimal impact on performance (using register-renaming, for example).

In other words, how much better would CMOV be if the hardware was given more micro architecture resources?


When I've seen CMOVcc, my mind jumped to the callcc name and now I can't stop thinking in what absurd way could you create an actual CMOV-with-current-continuation. Conditional-MOV-value-or-EIP-otherwise seems too trivial.


Decent speedup on my 3970x. Including this patch reduced the number of instructions in lzma_decoder.s (using gcc 8.3.1) by about 8% (4417 lines of asm vs 4824). With perf stat, an astonishing branches-missed reduction from 409K to 104K.

Using the firefox example from the gist:

  $ tar -cJf lib.tar.xz /usr/lib64/firefox
The xz shipped from the system:

  $ perf stat xz -c -d lib.tar.xz > /dev/null                                                                               

  Performance counter stats for 'xz -c -d lib.tar.xz':

          4,650.32 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
               591      page-faults:u             #    0.127 K/sec                  
    19,849,912,300      cycles:u                  #    4.269 GHz                      (83.33%)
       425,290,878      stalled-cycles-frontend:u #    2.14% frontend cycles idle     (83.33%)
     1,831,640,390      stalled-cycles-backend:u  #    9.23% backend cycles idle      (83.34%)
    23,973,036,103      instructions:u            #    1.21  insn per cycle         
                                                  #    0.08  stalled cycles per insn  (83.33%)
     2,939,144,233      branches:u                #  632.031 M/sec                    (83.34%)
       409,371,860      branch-misses:u           #   13.93% of all branches          (83.33%)

       4.650679926 seconds time elapsed

       4.611657000 seconds user
       0.011931000 seconds sys
The xz patched.

  $ git clone http://git.tukaani.org/xz.git
  $ cd xz/src
  $ patch -l -p1 < ../faster_lxma_decoder_x86.patch
  $ cd .. ; autogen.sh && configure && make
  $ LD_PRELOAD=./liblzma/.libs/liblzma.so
  $ perf stat ./xz/.libs/xz -c -d ../../lib.tar.xz > /dev/null                                                              

  Performance counter stats for './xz/.libs/xz -c -d ../../lib.tar.xz':

          3,578.54 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
               593      page-faults:u             #    0.166 K/sec                  
    15,186,685,715      cycles:u                  #    4.244 GHz                      (83.32%)
       108,663,507      stalled-cycles-frontend:u #    0.72% frontend cycles idle     (83.32%)
     8,753,057,119      stalled-cycles-backend:u  #   57.64% backend cycles idle      (83.34%)
    27,322,182,837      instructions:u            #    1.80  insn per cycle         
                                                  #    0.32  stalled cycles per insn  (83.35%)
     1,979,944,734      branches:u                #  553.282 M/sec                    (83.34%)
       104,752,154      branch-misses:u           #    5.29% of all branches          (83.34%)

       3.578973194 seconds time elapsed

       3.549329000 seconds user
       0.011942000 seconds sys


Hm, should the same be done for aarch64? Generally compilers (well, clang) seem to use CSEL all the time anyway, would be interesting to investigate this.


Can be done, ARM has better support for predicated instructions. May be done without using inline assembly.

But I guess ARM CPUs can also have a smaller branch penalty, which means less speedup from such a patch.

You can try it by replacing the inline assembly with the commented code above it (also don't forget to remove i386 and x86_64 from #if). (Although the code could be rewritten a bit to help the compiler make better binary code for ARM.)


So far I've got these results:

   Allwinner H616 (Cortex-A53) 64-bit mode
   
   linux-5.15.7.tar.xz : 25.12 --> 24.43 (+3%)
   linux-firmware-20211027.tar.xz : 22.80 --> 21.63 (+5%)
Maybe on more complex ARM processors the results will be better.

Update: It looks like I need to use inline assembly for AArch64 or GCC where it can make two CSEL instructions from the same condition - replaces them with if-else branch.

And I got better results (below) than when tried to avoid this compiler behavior but didn't use inline assembly (results above).

    linux-5.15.7.tar.xz : 25.12 --> 23.85 (+5%)
    linux-firmware-20211027.tar.xz : 22.80 --> 21.10 (+8%)


I tried it and realized that predicated instructions are removed from aarch64, which I was not aware of (because I rarely work with ARM). That sucks. But I'll try to do it without them.


Would using compiler intrinsics instead of raw assembler have any benefits here, such as being applicable to more architectures?


Well, ideally, compilers should be compiling into CMOV vs Branching decisions much better.

I'm shocked, but not that shocked (https://tenor.com/view/shocked-gif-5787388), that compilers today still are weaker than a dedicated raw-assembly programmer in these kinds of microarchitectural decisions. Even with what should be normal 64-bit code with compilers that have very good modeling of throughputs / latencies per instruction.

---------

I'm now more curious as to which compilers can turn the raw C-code into a cmov and which compilers turn the code into the less efficient form (I assume branching??)


Correctly-predicted branching is better; but if the branch would have been mis-predicted, cmov is better. The compiler has no idea which will happen at runtime.

Gcc will never under any circumstances generate two cmov instructions in a basic block, even when you definitely want that. Clang will. [Edit: I am wrong. Gcc can do more than one cmov in a basic block. Just not when I was trying it!]

In principle, profile-guided optimization could help, if the instrumented run is representative of production behavior. In my tests it did not affect the cmov/branch choice. Neither did the "% expected" intrinsic. Either of those could change in any release.

The Zstd and Lz4 decoders make very effective use of cmov. They should be your model.


Here are two cmov's with gcc: https://godbolt.org/z/j9bcv639c


I'm not a programmer by profession and barely one by hobby. I know nothing about compiler internals but looking at the definition of a basic block from Wikipedia, "In compiler construction, a basic block is a straight-line code sequence with no branches in except to the entry and no branches out except at the exit." it seems that the ternary operators in "return (x ? a : b) + (y ? c : d);" although part of the same programming block would be distinct compiler basic blocks.


Indeed, it seems contradictory to be talking about converting multiple branches within a basic block. Or even a single branch, for that matter.

Perhaps they're talking about what ends up in a single basic block after the conversion to cmov? Still an odd way to describe it.


Basic blocks are how compilers "conceptualize" our code and start thinking about optimization.

Frankly, my recommendation to you is to ignore that stuff. You need like 3 or 4 years of school to reach that level. You should have studied easier graph optimization problems (ex: Finite Automatia) first, before dealing with code-optimization.

The school curriculum is typically Programming 101 -> Data Structures -> Algorithms -> Finite Automata (aka: Regex) -> Pushdown Automata (aka: context-free grammars) -> Turing Machines (aka: general purpose code) -> Compilers (which use Finite Automata, Pushdown Automata, and Turing Machines simultaneously).

Basic Blocks are the graph structure that compilers use to think about code. Its a lot of graph-theory built up on a lot of language theory.

------

That being said: if you're actually interested in this stuff, then feel free to explore. But just beware, you've stumbled upon a very complex subject that is easily 4th year undergrad or even graduate-school level.

With enough effort, you'll understand things. But this is no subject for beginners to go around exploring by themselves.

-------

That being said, I want to encourage you to explore the subject anyway. Just explore the subject with awareness that there's a lot to understand here.

If you want to skip the 3ish years of elementary data-structures / comp. sci theory... I suggest starting with Static Single Assignment and working forward from here.

https://en.wikipedia.org/wiki/Static_single_assignment_form


Thank you. It appears Gcc won't put two cmov writes to memory in a block. Thus,

  void swap_if(bool c, int& a, int& b) {
    int ta = a, tb = b;
    a = c ? tb : ta;
    b = c ? ta : tb;
  }
is very slow, under Gcc, when c is poorly predicted, as is typical when e.g. partitioning for quicksort. But how well it will be predicted depends on input data.

[0] https://godbolt.org/z/j5W9dMjYE


To me it looks like something related to some other optimization pass (I don't know much about gcc passes). But not related to writes to memory. Here are two writes both using cmov (on different code): https://godbolt.org/z/n3dTrPo6e

Edit: compiling your code without modifications, but with `-Os` also gives two cmov's: https://godbolt.org/z/r86azb7be


In my experience, the key is that you've triggered two decisions off of a single boolean condition. A single ternary will be a cmov, but two ternaries become a branch.


> but two ternaries become a branch

I need to use inline assembly because of this.


Using CMOV vs. branching seems like a strategic decision which does make sense to fall under the programmer's decision making, as there is a trade-off of always calculating a value which may be unused vs. the cost of the conditional branch.


> as there is a trade-off of always calculating a value which may be unused vs. the cost of the conditional branch.

I don’t really understand the point you’re trying to make. Figuring out if a value is unused is most definitely the purview of an optimizer. Also, the “calculating a value” isn’t really the trade off being made between cmov and branching.


> I don’t really understand the point you’re trying to make. Figuring out if a value is unused is most definitely the purview of an optimizer.

If the conditional move doesn't happen, then the source (insofar as the move is concerned) is unused.

Consider this pseudocode:

    int value = some_nontrivial_function_with_no_side_effects();
    if (condition)
        *target = value;
Note that the function can be as simple as a memory read.

The compiler could compile this in two ways:

1. Observing that the function's result is used only if condition is true, move the function call inside the if block.

2. Always call the function, as in the source code, but compile the if block to a conditional move.

In such situations, it would make sense to allow programmers to indicate the desired strategy to the compiler.

I suppose CPUs might elide calculating the value even with a conditional move if they can predict the condition is [likely to be] false; I don't know how true that is in practice.

> Also, the “calculating a value” isn’t really the trade off being made between cmov and branching.

Depending on the situation and interpretation of terms, I also agree.


> In such situations, it would make sense to allow programmers to indicate the desired strategy to the compiler.

If the function truly has no side effects then why would the programmer care which strategy was used other than cost? This is just an optimization that can be mechanistically applied. And if it does have side effects then the two constructs are not equivalent - one would just put the function call inside the conditional.


the problem is that without pgo, the compiler can't know the odds that a condition will be true.


gcc and clang see the use of ternaries as a hint that they should use cmov, but it's obviously not guaranteed.


I'm trying to grok the performance improvement here.

The diff points to:

    + /* tmp = rc.code < rc_bound ? rc.range = rc_bound, ~0 : 0; */ \
    + __asm__ ( \
    +  "cmpl %3, %2\n\t" \
    +  "cmovbl %3, %1\n\t" \
    +  "sbbl %0, %0" \
    +  : "=&r"(tmp), "+&r"(rc.range) \
    +  : "r"(rc.code), "r"(rc_bound) \
    + ); \
This is the only use of assembly in this entire diff. That "tmp = rc.code < rc_bound ?..." statement looks like it'd probably be a cmov, at least I'd expect it to compile to cmov without much issue.

However, this is some advanced "carry-flag manipulation" stuff going on here. I'm not sure if its the "cmov" per se that was advanced, as much as the sbbl statement (subtract borrow, the subtraction-analog to adc). sub %0, %0 is obviously "zero", but sbbl %0, %0 is "0xFFFFFFFF" if carry is 1.

Its certainly a very well thought out bit of manual assembly language. The sbb is probably more important than the cmov (in that the compiler probably emits the cmov, but may not see the sbb????)

----------

The original code seems to be: https://github.com/COMBINE-lab/xz/blob/master/src/liblzma/ra...

    #define rc_direct(dest, seq) \
    do { \
     rc_normalize(seq); \
     rc.range >>= 1; \
     rc.code -= rc.range; \
     rc_bound = UINT32_C(0) - (rc.code >> 31); \
     rc.code += rc.range & rc_bound; \
     dest = (dest << 1) + (rc_bound + 1); \
    } while (0)
I don't know what its doing, but the cmov stuff is very, very different entirely. There doesn't seem to be a branch involved at all in this "rc_direct" inner-loop.

Its not very clear to me how they saw this sequence of C, and the decided upon a cmov / sbb approach to do this equivalent work. Its clearly some kind of advanced thinking that no compiler would have gotten.


It's replacing rc_bit not rc_direct (rc_direct is a fixed .5 probability), which just uses rc_bit_last, so original code is:

    #define rc_bit_last(prob, action0, action1, seq) \
    do { \
     rc_if_0(prob, seq) { \
      rc_update_0(prob); \
      action0; \
     } else { \
      rc_update_1(prob); \
      action1; \
     } \
    } while (0)
So it's merging the two sides of rc_update and cmov/sbb handle the difference. action0/action1 are generally blank, but rc_bit_matched makes the common action branchless as well.


Thanks for the info.

Realizing that the two sides of the "if" statement are effectively stored as a 1-bit of information, and then using instructions to store that 1-bit into the carry flag, and then using cmovcc / sbbl instructions to manipulate that 1-bit of information is pretty advanced.

I could imagine a compiler making use of all that information, but only maybe with some programmer assistance (IE: code in the right format for easier optimizing). Or you know... an assembly programmer who just outright explicitly writes that kind of assembly language.


Yeah, this sort of conversion is pretty common for the core of arithmetic/range decoders, and I definitely don't expect a compiler to be capable of the sort of transformations needed to convert the branchy form to branchless.

Though I'd expect compilers to be at least 90% as good with the ternary equivalent to the asm, e.g.:

    tmp = code < bound ? ~0 : 0;
    range = code < bound ? bound : range;
compiles tmp to xor -> setl -> neg (clang) or setge -> movzx -> sub (gcc) instead of sbb, but on arm64 they use just one csetm (clang) or csinv (gcc) which is basically optimal, and in all cases the comparison is reused for the cmov.

https://godbolt.org/z/jPbo3qc1Y


If you're going for decoding speed wouldn't you be using brotli anyway? At a given compression ratio brotli is ~4x faster on decode path, and they are similar on the encode path.

Your corpus may vary etc etc.


Contrary to popular belief (sadly, even among the tech crowd here), people use computers for more than just web stuff.


IIRC, in a recent thread about zstd, someone claimed brotli was faster. Perhaps one needs to look at whether the sample being compressed is web stuff.


If you’re too busy to check yourself, the “squash compression benchmark” page has results for dozens of codecs and different inputs.


Brotli is not just for Web stuff.


Brotli is heavily tuned for "web stuff". It's literally got a built-in dictionary full of commonly used HTML fragments:

https://gist.github.com/klauspost/2900d5ba6f9b65d69c8e


Shoot me, works well for the average desktop (=Electron) application then.


Electron is just web stuff.


Sure but it actually benchmarks well broadly maybe despite its heavy web optimizations.


Sure. In theory. But every other response to this comment are clearly talking about web stuff, and in practice brotli is really only used for web stuff.


I don't know what's theoretical about it. There are Debian packages for brotli and it just behaves like a stdio codec, same as the `xz` command.


Brotli has a static dictionary heavily geared towards web formats, and there really is no reason to use brotli outside of web stuff: the point of brotli is that it has browser support and does 15~20% better than gzip.

It's not bad per-se, but if you're not constrained to the web zstd will have a better compression ratio at every throughput (and better throughput at every compression ratio), and if you want maximum compression period it's not close to competing with LZMA-based formats (xz, lzip).


I think he mean is less popular (like zstandard and lz4) compared to zlib and LZMA and BZip2. this is unfortunate because Brotli and Zstandard compress much better than zLib and LZ4 and decompress much faster than zlib lzma and LZ4


I’m using the phrase “in theory” colloquially, which means something like “sure, you could technically use it that way, but nobody really does/it’s very rare”.


Linux distros often ship xz out of the box. I’ve never seen brotli in the wild. YMMV of course - I am sure there are exceptions.

I’ve used brotli myself, but only as it’s included in some parquet read/write libs.


Probably 10-40% of what your browser is served nowadays is brotli, it's in Chrome so all the major CDNs support it by default


I’m sure you’re right, but files-on-disk is another story. I was referring to the latter.


Jpeg-XL makes use of Brotli, and once it hits 1.0 it will most likely be turned on by default in browsers and thus you will probably see a lot of jxl in the wild, particularly since it provides lossless recompression to existing jpeg files with ~20% smaller size as a result.


It’s definitely in use by Meta and Google.


I'm not the OP, but I took their comment to mean "on the command line".


cloudflare uses brotli (not sure if optional or by default)


I don't know about web use, but xz's major advantage for me is the seekable format. We build disk image pipelines based on remote xz-compressed images where only the parts of the file being accessed need to be downloaded and uncompressed. https://libguestfs.org/nbdkit-xz-filter.1.html


But this seems like a patch given to the LZMA project, with an interesting low-level result to boot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: