Are there micro-optimizations that eke out small advancements? Yes, absolutely -...

Are there micro-optimizations that eke out small advancements? Yes, absolutely - the modern tokenizer is a good example of that.

Is the core of the technology that complex? No. You could get very far with a naive tokenizer that just tokenized by words and replaced unknown words with <unk>. This is extremely simple to implement and I've trained transformers like this. It (of course) makes a perplexity difference but the core of the technology is not changed and is quite simple. Most of the complexity is in the hardware, not the software innovations.

> And we need many more to get the technology to an actually usable level instead of the current word spaghetti that we get.

I think the current technology is useable.

> you shouldn’t just choose the attention hammer and hammer away

It's a good first choice of hammer, tbph.