Author of Highway here :) PSHUFB is TableLookupBytes, or TableLookupBytesOr0 if ...

anonymoushn · on Nov 28, 2022

If I have some time I'll try porting a small parser I wrote the other day to Highway. I expect to gain a lot of bitcasts to change the element type, a lot of angle brackets, and a lot of compile time. It looks like there isn't a way to ask for permute4x64, so I will maybe replace my permute4x64 with SwapAdjacentBlocks, shuffle, and blendv, which will probably cost a register or two and some cycles.

I don't think there are necessarily problems with the design of Highway given its constraints and goals. I would just personally find it easier for most of the SIMD code I've written (this tends to be mostly parsers and serializers) to write several implementations targeting AVX2, AVX-512, NEON, and maybe SSE3.

janwas · on Nov 28, 2022

Thanks for your feedback! I'd be happy to advise on the port, feel free to open an issue if you'd like to discuss.

I suppose there is a tradeoff between zero type safety + less typing, vs more BitCast verbosity but catching bugs such as arithmetic vs logical shifts, or zero extension when we wanted signed.

Would be interesting if you see a difference in compile time. On x86 much of the cost is likely parsing immintrin.h (*mmintrin are about 500 KiB on MSVC). Re-#including your own source code doesn't seem like it would be worse than parsing actually distinct source files.

Many of the Highway swizzle ops are fixed pattern; several use permute4x64 internally. Can you share the indices you'd like? If there is a use case, we are happy to add an op for that.

It is impressive you are willing (and able) to write and maintain code for 4 ISAs. Still, wouldn't it be nice to have also SVE for Graviton3, SVE2 on new Arm9 mobile/server CPUs, and perhaps RISC-V V in future?

anonymoushn · on Nov 29, 2022

> Can you share the indices you'd like? If there is a use case, we are happy to add an op for that.

I'm using 2031 and 3120 (indices and the rest of this comment in memory order assuming an eventual store, so that's imm8 = (2 << 0) | (0 << 2) | (3 << 4) | (1 << 6) etc.). This is pretty niche stuff! It's also not the only way to accomplish this thing, since I am starting with these two vectors of u64:

  v1 = a a b b
  v2 = c c d d

then as an unimportant intermediate step

  v3 = unpacklo(v1 and v2 in some order) = a c b d

and I ultimately want

  v4 = permute4x64(v3, 3120) = c d b a
  v5 = permute4x64(v3, 2031) = b a c d

> It is impressive you are willing (and able) to write and maintain code for 4 ISAs.

This is mostly hypothetical, right now I'm writing software targeting specific servers and only using AVX2 and AVX-512. I may get to port stuff to NEON soon, but I've only skimmed the docs and said to myself "looks like the same stuff without worrying about lanes," not written any code.

> Still, wouldn't it be nice to have also SVE for Graviton3, SVE2 on new Arm9 mobile/server CPUs, and perhaps RISC-V V in future?

I don't know much about variable-width vector extensions like SVE and SVE2 but my impression is that they are difficult to use for the sorts of things I usually write. These things are often a little bit like chess engines and a little bit less like linear algebra systems, so the implementation is designed around a specific width. This isn't really a rebuttal though, it just means I am signing up for even more work by writing code on a per-extension per-width basis instead of a per-extension basis if I ever move on to these targets, which I certainly should if they have large performance benefits compared to NEON.

anonymoushn · on Nov 29, 2022

should be

  v4 = permute4x64(v3, 3120) = d c b a
  v5 = permute4x64(v3, 2031) = b a d c

janwas · on Nov 29, 2022

Thanks for the example. This seems like a reasonable thing to want, and as you say it's not ideal to SwapAdjacentBlocks and then shuffle again. It's not yet clear to me how to define an op for this that's useful in case vectors are only 128-bit. Until then, it seems you could use TableLookupLanes (general permutation) at the same latency as permute4x64, plus loading an index vector constant?

> These things are often a little bit like chess engines and a little bit less like linear algebra systems, so the implementation is designed around a specific width.

hm. For something like a bitboard or AES state, typically wider vectors mean you can do multiple independent instances at once. Likely that's already happening for your AVX-512 version? If you can define your problem in terms of a minimum block size of 128 bits, it should be feasible.

> I ever move on to these targets, which I certainly should if they have large performance benefits compared to NEON.

SVE is the only way to access wider vectors (currently 256 or 512 bits) on Arm.