> The argument is entirely contrived and has no root in facts. Compiler built-in...

> The argument is entirely contrived and has no root in facts. Compiler built-ins appeared in GNU C/C++ compilers as an attempt to replace the non-portable inline assembly with portable primitives

This is missing the point entirely.

GCC needs to emit e.g. memory copies. Before, this was inline assembly replicated over and over. Now, it's a call to __builtin_memcpy.

The point missed is that GCC always considered the idea of calling memcpy entirely unacceptable as the performance would be horrible over an inline implementation.

The proof of this intent lies in later optimizations: Not only would GCC never want to emit such slow calls, it replaces your explicit libc calls with builtins because obviously you wouldn't want to do something as slow as a dynamic linkeage call.

With static linking and LTO, the libc implementation becomes as good as the builtin, rendering the latter pointless. GCC just cannot assume this to be the case.

> CPU caches work at addresses being accessed level, not at the process level.

No, CPU caches do not work on addresses, they work on tags to be pedantic. Either way, I never said that caches are process level. I said that they do not survive across multiple processes - not because of flushing, but because of trashing. I.e., if you have three processes, A, B and C, where A and C run shared code while B something else, and you switch A -> kernel -> B -> kernel -> C, then by the time you made it form A to C your cache is has been trashed by both B and the kernel.

Now, instead of 3 processes and one routine, make it thousands of threads and gigabytes of shared libraries.

> One copy of «strlen» in a single memory page at a single physical memory address shared across all processes

Again, strlen is a terrible example: 10k copies of strlen being a handful of bytes in the current instruction stream, prefetched and branch predicted will outperform that shared page to an outright ridiculous extent and might even be smaller in total: a 10k copies of a handful of bytes vs. 10k calls and PLT indirections + the un-inlined function. Because it is literally less memory, it also trashes the TLB less.

Even in more realistic cases, remember the TLB hit of the PLT table in each application, not to mention the many more pages consumed by the bulkier implementation. In fact, let's focus a bit on the TLB. The most basic Gtk app links at least 80 libraries worth over 90 megabytes on my system. An L1 TLB has about 64 entries, the L2 around a thousand or so - so it can reference ~16MB worth of memory or thereabout. In other words, even the L2 TLb is about 6 times too small to keep the libraries of the simplest possible gtk app cached.

Heck, take just libicudata at 30MB. Of course, I wouldn't suggest statically linking that, but just pointing out that a single dependency of a Gtk app is enough to fill up the TLB twice, nullifying the idea of any cache benefit to using these libraries.

"Yes but at least they can have libicudata in L3!" - yeah, no - not only would it compete with other dynamic dependencies (for this and other processes), but more importantly the applications also need to process data. A single Gtk app on a 4k monitor will, for example, be managing at least two 32MB framebuffers (3840x2160x4, x2 for double buffering), so that's most of your cache gone during draw before you even consider the input to the draw or any actual functionality of the app!

The best-case for dynamic linkage performance is cases where call cost is irrelevant, e.g. when calling compute routines. There is no point whatsoever in considering CPU caches outside the scope of the currently running process.