I wonder if the outputted text-vectors will be similar to e.g the outputted image vector if they have the same "content". E.g will the vector for the text "A dog on a green field" be similar in the vector space to an image showing a dog on a green field? If so it opens up a lot of opportunities in the information retrieval / recommender system space.
Edit: After reading the paper I see that they train an individual model for each modality, which means that they are not comparable. It would be interesting to see though if they could combine sound, text and images into the same training in the future.
Mapping corresponding text and image into the same vector space is exactly how CLIP and other contrastive learning setups work: Have text and computer vision networks embed input data, and teach the networks to embed related inputs closely and unrelated ones far away. You can train the models on data scraped from the internet (e.g. images and corresponding captions) in a self-supervised way. Like you say it has huge applications for search and is also used by the generative models (to guide a generated image towards your textual input).
BTW it seems that the data2vec is at its core actually just one model, only the input and output parts are different depending on the modality (text, image sound etc.) I wouldn't expect the learned representations to be similar for similar content across modalities. The point of data2vec is to use just one model with a very general self-supervised learning setup for tasks, with the different tasks hopefully benefitting from each other during training.
It can do image recognition, speech recognition or text classification (eg, is this positive or negative sentiment) using the same model architecture (trained on different data each time)
It's very competitive in each of those fields with the existing state of the art. This is interesting because usually the models for each of those fields are different.
> How does it work?
It's trained by passing sequences of data (pixels sequentially, words in order, speech as a wav file) with part of each sequence masked out. The model has to learn to correctly guess what is in that masked area.
The innovation is that it instead of predicting the masked area directly, it tries to predict what the representation of that masked area is in the Neural Network itself. That's extremely unusual (I don't think I've seen that before) and I'll need to study it more to completely understand why that explains the better performance.
Hubert does something similar; it predicts masked spectrogram regions during its first phase, and then masked embeddings later in training. It's a bit of a pain to get the complicated training regimen to work well, though.
I’ve only scanned the blog but have read OG Data2Vec paper.
Data2Vec presents an architecture that performs well across the main benchmarks for vision, speech and text. The architecture is a variation on the transformer network with slight tweaks for each learning modality. Data2Vec 2 seems to be more a more efficient variant.
In terms of applications, data2vec gives a single reliable architecture for each approach. Whereas before you may have used a CNN for vision and a transformer for text etc.
Additionally, this research is building towards multi-modal learning where an architecture could be trained on images, text and speech to learn about a topic. (But to my knowledge there isn’t anything ground breaking in this space yet).
Not only does it perform better as a generalist, it also groups embeddings for the the multiple modalities of the same concept close together (keeps bird sounds close to bird image in its internal representation, etc). This way it can learn in different dimensions, just like humans do.
You could ask it to sing like a bird in text, and it could respond in image + sound, or just sound. This allows for better general understanding of things. Deepmind did the same with Gato: https://www.deepmind.com/publications/a-generalist-agent
I can’t see where in the paper Data2Vec is reported to have contextual embeddings that work across modalities. Can you reference the section in the paper?
Different models for now, but it's essentially doing this through the teacher model. If you go down to the numbers, it's very similar. They're saying they want to do this later.
Gato does this in the same embeddings IIRC, but I skimmed the paper so I could be wrong.
Data2Vec uses a “standard” transformer architecture.[1]
In Data2Vec2 the transformer architecture forms the “bulk of the model weights”.[2]
The tweaks I eluded to should have more clearly referenced that each modality uses a different encoder.
[1] section 3.1 data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language.
[2] section 3.2 Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language.
The issue here is it is sort of like taking a technical blog post involving say details of the Haskell types vs Typescript types for a specific problem and explaining it to someone who doesn't know programming.
There's a bunch of context you kind of need for it to make sense.
I'm not convinced that each technical blog post should provide that context every time.
Edit: After reading the paper I see that they train an individual model for each modality, which means that they are not comparable. It would be interesting to see though if they could combine sound, text and images into the same training in the future.