That's pipelining and it's good for throughput but it sacrifices latency. Audio is not a continuous bit stream but a series of small packets. To begin working on the next one on the CPU while the previous one is on the GPU requires 2 samples in flight which necessarily means higher latency
I don’t see that. If the CPU part starts processing packet #2 while the GPU processes packet #1, not after it has done so, it will have the data that has to be sent to the GPU for packet #2 ready earlier, so it can send it earlier, potentially the moment the GPU has finished processing packet #1 (if the GPU is powerful enough, possibly even before that)
That’s why I asked about the plug-in APIs. They may have to be async, with functions not returning when they’re fully done processing a ‘packet’ but as soon as they can accept more data, which may be earlier.
But in general no, you can't begin processing a buffer before finishing the previous buffer because the processing is stateful and you would introduce a data race. And you can't synchronize the state with something simple like a lock, because locking the audio playback is forbidden in real time.
You can buffer ahead of time, this introduces latency. You can't do things ahead of time without introducing delay, because of causality - you can't start processing packet #2 while packet #1 is in flight because packet #2 hasn't happened yet.
To make it a bit more clear why you can't do this without more latency:
Under the hood there is an audio device that reads/writes from a buffer at a fixed interval of time, call that N (number of samples, multiply by sample rate to get in seconds). When that interval is up, the driver swaps the buffer for a new one of the same size. The OS now has exactly (N samples * sample_rate) to fill the buffer before its swapped back with the device driver.
The kernel maps or copies the buffer into virtual memory, wake the user space process, call a function to fill the buffer, and return back to kernel space to commit it back to the driver. The buffer you read/write from your process is packet #1. Packet #2 doesn't arrive until the interval ticks again and buffers are exchanged.
Now say that processing packet #1 takes longer than N samples or needs at least M samples of data to do its work and M > N. What you do is copy your N samples of packet #1 into a temporary buffer, what until M samples have been acquired to do your work, but concurrently read out of your internal buffer delayed by M - N samples. You've successfully done more work, but delayed the stream by the difference.
You're requiring that packet #2 be available before packet #1 has finished. That's higher latency than the goal, which is packet #1 is processed & sent to output before packet #2 has arrived at all.
Or perhaps you're missing that there's an in event as part of this, like a MIDI instrument? It's an in->effect->out sequence. So minimizing latency means that the "effect" part must be as small as possible, which means it's desired for it to happen faster than "in" can feed it data