>I personally don't really see the point in giving meaning to the Q, K, V parts. It doesn't actually matter what Q, K, V do, it's the training algorithms' job to assign it a role automatically.
I was under the impression that the names Q K and V were historical more than anything. There is a definite sense of information flowing from the K to the Q because the V going to the next layer Q comes from the same index as the K.
I agree that it's up to the training to assign the role for the components, but there is still value in investigating the roles that are assigned. The higher level insights you can gather can lead to entirely different mechanisms to perform those roles.
That's very much what most model architectures are, efficiency guides. A multi layer perceptron with an input width the size of context_window*token_size would be capable of assigning rolls better than anything else, but at the cost of being both unfeasibly slow and unfeasibly large.
I'm a little surprised that there isn't a tiny model that generates a V on demand when it is accumulated with the attention weights, A little model that takes the Q and K values and the embeddings that they were generated from. That way when there is a partial match between the Q and K causing a decent attention value it can use the information of what parts of Q and K match to decide what V information is appropriate to pass on. It would be slower, and caching seems to be going in the other direction, but it seems like there is information that should be significant there that just isn't being used.
I was under the impression that the names Q K and V were historical more than anything. There is a definite sense of information flowing from the K to the Q because the V going to the next layer Q comes from the same index as the K.
I agree that it's up to the training to assign the role for the components, but there is still value in investigating the roles that are assigned. The higher level insights you can gather can lead to entirely different mechanisms to perform those roles.
That's very much what most model architectures are, efficiency guides. A multi layer perceptron with an input width the size of context_window*token_size would be capable of assigning rolls better than anything else, but at the cost of being both unfeasibly slow and unfeasibly large.
I'm a little surprised that there isn't a tiny model that generates a V on demand when it is accumulated with the attention weights, A little model that takes the Q and K values and the embeddings that they were generated from. That way when there is a partial match between the Q and K causing a decent attention value it can use the information of what parts of Q and K match to decide what V information is appropriate to pass on. It would be slower, and caching seems to be going in the other direction, but it seems like there is information that should be significant there that just isn't being used.