> slow not due to mem bandwidth, but due to PCIe bandwidth, which is the bottleneck.
> On server/workstation motherboards ... the memory throughput [to system RAM] achievable by the GPU becomes a very small fraction of the system memory bandwidth.
Yes, this is a critical point. It means that this is only realistically useful for prefill, which is compute- and not memory-bandwidth bound.
And Incidentally prefill would also be how caching,say, a system prompt saves you some $ for API usage with LLM providers. They only compute the kv cache for the new tokens after the system prompt.
> On server/workstation motherboards ... the memory throughput [to system RAM] achievable by the GPU becomes a very small fraction of the system memory bandwidth.
Yes, this is a critical point. It means that this is only realistically useful for prefill, which is compute- and not memory-bandwidth bound.