Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’ve only scanned the blog but have read OG Data2Vec paper.

Data2Vec presents an architecture that performs well across the main benchmarks for vision, speech and text. The architecture is a variation on the transformer network with slight tweaks for each learning modality. Data2Vec 2 seems to be more a more efficient variant.

In terms of applications, data2vec gives a single reliable architecture for each approach. Whereas before you may have used a CNN for vision and a transformer for text etc.

Additionally, this research is building towards multi-modal learning where an architecture could be trained on images, text and speech to learn about a topic. (But to my knowledge there isn’t anything ground breaking in this space yet).



Not only does it perform better as a generalist, it also groups embeddings for the the multiple modalities of the same concept close together (keeps bird sounds close to bird image in its internal representation, etc). This way it can learn in different dimensions, just like humans do.

You could ask it to sing like a bird in text, and it could respond in image + sound, or just sound. This allows for better general understanding of things. Deepmind did the same with Gato: https://www.deepmind.com/publications/a-generalist-agent


I can’t see where in the paper Data2Vec is reported to have contextual embeddings that work across modalities. Can you reference the section in the paper?


Different models for now, but it's essentially doing this through the teacher model. If you go down to the numbers, it's very similar. They're saying they want to do this later.

Gato does this in the same embeddings IIRC, but I skimmed the paper so I could be wrong.


Thanks! I should probably read the gato paper. If only there were enough hours in the day.


Thanks! Nice write up

I think it would help if I knew of a concrete task I could use this for.


Image or speech recognition.


They're using a CNN for the decoder, not a transformer.


They are using CNNs to encode the input. But…

Data2Vec uses a “standard” transformer architecture.[1] In Data2Vec2 the transformer architecture forms the “bulk of the model weights”.[2]

The tweaks I eluded to should have more clearly referenced that each modality uses a different encoder.

[1] section 3.1 data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. [2] section 3.2 Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: