It seems to me that a table of function pointers is all that's required. Highway is a little fancier in that the first entry is a trampoline that first detects CPU capabilities and then calls your pointer; subsequent calls go straight to the appropriate function.
Do the (experimental/non-portable) compiler versions contribute any additional value?
I gather from the linked-to video that binary-load-time selection has better run-time performance than init-at-first-call run-time dispatch, and doesn't have the tradeoff between performance and security.
Thanks for the pointer. I read the video transcript and agree with their premise that indirect calls are slow.
The are several ways to proceed from there. One could simply inline FastMemcpy into a larger block of code, and basically hoist the dispatch up until its overhead is low enough.
Instead, what they end up doing is pessimizing memcpy so that it is not inlined, and even goes through another thunk call, and defers the cost of patching until your code is paged in (which could be in a performance or latency-sensitive area). Indeed their microbenchmark does not prove a real-world benefit, i.e. that the thunks and patching are actually less costly than the savings from dispatch. It falls into the usual trap of repeating something 100K times, which implies perfect prediction which would not be the case in normal runs.
Also, the detection logic is limited to rules known to the OS; certainly sufficient for detecting AVX-512, probably harder to do
something like "is it an AVX-512 where compressstoreu or vpconflict are super slow". And certainly impossible to do something reasonable like "just measure how my code performs for several codepaths and pick the best",
or "specialize my code for SVE-256 in Graviton3".
So, besides the portability issue, and actually pessimizing short functions (instead of just inlining them), this prevents you from doing several interesting kinds of dispatch.
Caveat emptor.
And in clang - https://releases.llvm.org/7.1.0/tools/clang/docs/AttributeRe... .
I've never used them.
Don't know about other compilers.