> tl;dr: the LISP syntax is just syntax sugar. The textual format is as "stack-like" as the binary format.
Not that you're technically wrong, but I think you're begging the question.
Stack-based languages/encodings, in a colloquial sense, are equated to postfix notation, e.g. `a b +` instead of the infix `a + b`. Both LISP and textual Wasm use prefix notation, e.g. `(+ a b)`. Neither of the three is any more foundational than the other -- all notations can encode all expression trees, and postfix and prefix notations in particular have the same coding efficiency.
So sure, the LISP syntax is sugar, but for what? It's not sugar for a stack program, because prefix notation in general can't represent an arbitrary stack program; it's sugar for a mathematical expression. Which is encoded in postfix notation in binary, sure, but that's just an implementation detail, and prefix notation could've been selected when Wasm was born with little adversarial consequences.
My brain does not understand how one can see `(+ a b)` in text and read it as `a b +`. I suppose you can argue that moving operators to the right and removing all parentheses converts `(+ a b)` to `a b +`, sure, but that applies to pretty much any language -- next thing you'll tell me is the JavaScript expression `f(1, g(2), h(3))` actually uses stack and is just syntactic sugar for `1 2 g 3 h f`.
> It is explicity sugar for the stack operations, per my reading of the spec.
The entire Wasm text format is syntax sugar for binary Wasm, so this is kind of a vacuous truth -- if Wasm's spec says it's a stack machine, of course everything is sugar for a stack machine. But if you weren't aware of Wasm and just saw a program in textual Wasm for the first time, I don't think the idea that it's stack-based would cross your mind.
I have reread this several times but might be missing so I am begging the question, what exactly makes the LISP syntax sugar for something that isn't a stack machine? Or did I misread that?
If not, I think the OP is making the same point we all are, any program can be translated for execution on any machine - so bringing it up in the blog seems weak, which I agree with.
There's just so many "but"s to this that I can't in good faith recommend people to treat floats as deterministic, even though I'd very much love to do so (and I make such assumptions myself, caveat emptor):
- NaN bits are non-deterministic. x86 and ARM generate different sign bits for NaNs. Wasm says NaN payloads are completely unpredictable.
- GPUs don't give a shit about IEEE-754 and apply optimizations raging from DAZ to -ffast-math.
- sin, rsqrt, etc. behave differently when implemented by different libraries. If you're linking libm for sin, you can get different implementations depending on the libc in use. Or you can get different results on different hardware.
- C compilers are allowed to "optimize" a * b + c to FMA when they wish to. The standard only technically allows this merge within one expression, but GCC enables this in all cases by default on some `-std`s.
You're technically correct that floats can be used right, but it's just impossible to explain to a layman that, yes, floats are fine on CPUs, but not on GPUs; fine if you're doing normal arithmetic and sqrt, but not sin or rsqrt; fine on modern compilers, but not old ones; fine on x86, but not i686; fine if you're writing code yourself, but not if you're relying on linear algebra libraries, unless of course you write `a * b + c` and compile with the wrong options; fine if you rely on float equality, but not bitwise equality; etc. Everything is broken and the entire thing is a mess.
Yes, there are a large number of ways to fall into traps that cause you to do a different series of operations when you didn't realise that you did. But that's still ultimately what all your examples are. (Except the NaN thing!)
I still think it's important to fight the misinformation.
Programmers have been conditioned to be so afraid of floats that many believe that doing a + b has an essentially random outcome when it doesn't work that way at all. It leads people to spend a bunch of effort on things that they don't need to be doing.
> Very tangential, but I could swear QBasic included an on-disk documentation system accessible from the editor. Maybe only later versions?
Perhaps my installation didn't include it, or maybe you're confusing it with QuickBASIC, a more feature-complete IDE with a compiler (instead of just an interpreter). I don't exactly remember.
I don't think it's possible to apply this trick to 64-bit floats on 64-bit architecture, which OP mentions in the last sentence. You need a 52 x 52 -> 104 product. Modular 64 x 64 -> 64 multiplication gives you the 64 bottom bits exactly, widening 32 x 32 -> 64 multiplication approximately gives you the top 32 bits. That leaves 104 - 64 - 32 = 8 bits that are not accounted for at all. Compare with the 32-bit case, where the same arithmetic gives 46 - 32 - 16 = -2, i.e. a 2-bit overlap the method relies on.
Yeah, it feels intuitively like it should work, but even with full 64-bit modular multiply the bounds don't overlap. I should probably delete that sentence, but on the other hand maybe the wild goose chase will lead to someone finding a better trick. For now my double-precision multiply is just direct expansion.
I think it fails because it seems like the difference between 32-bit and 64-bit floats is 2x, but in reality we should look at the mantissa, and the increase from 23 bits to 52 bits is much greater.
Although I managed to tweak this method to work with 3 multiplications.
ETA: I just realized you wanted to use 32x32 -> 64 products, while my approach assumes the existence of 64x64 -> 64 products; basically it's just a scaled-up version of the original question and likely not what you're looking for. Hopefully it's still useful though.
First, remove the bottom 8 bits of the two inputs and compute the 44x44->88 product. This can be done with the approach in the post. Then apply the algorithm again, combining that product together with the product of the bottom half of the input to get the full 52x52->104 output. The bounds are a bit tight, but it should work. Here's a numeric example:
a = 98a67ee86f8cf
b = da19d2c9dfe71
(a >> 20) * (b >> 20) = 820d2e04637bf428
(a >> 8) * (b >> 8) % 2**64 = 0547f8cdb2100210
->
(a >> 8) * (b >> 8) = 820d2e0547f8cdb2100210
(a >> 8) * (b >> 8) = 820d2e0547f8cdb2100210
(a * b) % 2**64 = 080978075f64355f
->
a * b = 820d2e0548080978075f64355f
You can still write code without LLMs, much like you can write code without modern IDEs, or use C and assembly instead of higher-level languages. But there are significant differences between the skills you learn in the process, which I believe inhibits upward mobility.
I must admit I'm surprised to see this -- Lemire offhandedly mentioned in the famous remainder blog post (https://lemire.me/blog/2019/02/08/faster-remainders-when-the...) that 64-bit constants can be used for 32-bit division, and even provided a short example to compute the remainder that way (though not the quotient). Looking a bit more, it seems like libdivide didn't integrate this optimization either.
I guess everyone just assumed that this is so well-known now, that compilers have certainly integrated it, but no one actually bothered to submit a patch until now, when it was reinvented?
The paper doesn't require a bitshift after multiplication -- it directly uses the high half of the product as the quotient, so it saves at least one tick over the solution you mentioned. And on x86, saturating addition can't be done in a tick and 32->64 zero-extension is implicit, so the distinction is even wider.
Unfortunately, that's only vector, and ≤16-bit ints at that, no 32-bit ints; and as the other reply says, nearly non-existent multiply-high which generally makes vectorized div-by-const its own mini-hell (but doing a 2x-width multiply with fixups is still better than the OP 4x-width method).
(...though, x86 does have (v)pmulhw for 16-bit input, so for 16-bit div-by-const the saturating option works out quite well.)
(And, for what it's worth, the lack of 8-bit multiplies on x86 means that the OP method of high-half-of-4x-width-multiply works out nicely for vectorizing dividing 8-bit ints too)
On x86, there is no vector instruction to get the upper half of integer product (64-bits x 64-bits). ARM SVE2 and RISC-V RVV have one, x86 unfortunately does not (and probably wont for a long time as AVX10 does not add it, either).
There is one for the f64 FMA recycling IFMA from AVX512 they have for bignum libraries;it's a 52 bit unsigned multiply and accumulates either the low or the high output halves into a 64bit accumulator.
It's surely no 64 bit but it's much more than 32 bit.
And it's giving you access to the high halves so you can use it to compute 32x32->64 on vector even if only half as packed as that could be.
Honorary mention: byte swapping instructions (originally added to CPUs for endianness conversion) can also be used to redistribute entropy, but they're slightly slower than rotations on Intel, which is why I think they aren't utilized much.