In NLP, the simplest positional embeddings are just a sequence of [0, 1, 2, 3, 4...

In NLP, the simplest positional embeddings are just a sequence of [0, 1, 2, 3, 4, .. n] with n being the length of a sentence. Once you start processing text, you split it into individual embeddings for each word, getting a list of embeddings [E_1, E_2, E_3, ... E_n]. Then you just sum up this embedding vector and positional embeddings vector across the first axis, so each embedding is increased by its positional embedding in other dimensions. This somehow works for encoding position as attention is dumb and doesn't keep positions in mind but you just added some sort of a positional signal attention will recognize. There are many ways to model positional embeddings, e.g. by generating sine/cosine curves etc. You can extend this approach to ViTs.