I’ve only scanned the blog but have read OG Data2Vec paper. Data2Vec presents an...

m00x · on Dec 13, 2022

Not only does it perform better as a generalist, it also groups embeddings for the the multiple modalities of the same concept close together (keeps bird sounds close to bird image in its internal representation, etc). This way it can learn in different dimensions, just like humans do.

You could ask it to sing like a bird in text, and it could respond in image + sound, or just sound. This allows for better general understanding of things. Deepmind did the same with Gato: https://www.deepmind.com/publications/a-generalist-agent

notsoprocoder · on Dec 14, 2022

I can’t see where in the paper Data2Vec is reported to have contextual embeddings that work across modalities. Can you reference the section in the paper?

m00x · on Dec 15, 2022

Different models for now, but it's essentially doing this through the teacher model. If you go down to the numbers, it's very similar. They're saying they want to do this later.

Gato does this in the same embeddings IIRC, but I skimmed the paper so I could be wrong.

notsoprocoder · on Dec 15, 2022

Thanks! I should probably read the gato paper. If only there were enough hours in the day.

bilsbie · on Dec 13, 2022

Thanks! Nice write up

I think it would help if I knew of a concrete task I could use this for.

nl · on Dec 14, 2022

Image or speech recognition.

zone411 · on Dec 14, 2022

They're using a CNN for the decoder, not a transformer.

notsoprocoder · on Dec 14, 2022

They are using CNNs to encode the input. But…

Data2Vec uses a “standard” transformer architecture.[1] In Data2Vec2 the transformer architecture forms the “bulk of the model weights”.[2]

The tweaks I eluded to should have more clearly referenced that each modality uses a different encoder.

[1] section 3.1 data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. [2] section 3.2 Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language.