Data2vec 2.0: Highly efficient self-supervised learning for vision, speech, text

opdahl · on Dec 14, 2022

I wonder if the outputted text-vectors will be similar to e.g the outputted image vector if they have the same "content". E.g will the vector for the text "A dog on a green field" be similar in the vector space to an image showing a dog on a green field? If so it opens up a lot of opportunities in the information retrieval / recommender system space.

Edit: After reading the paper I see that they train an individual model for each modality, which means that they are not comparable. It would be interesting to see though if they could combine sound, text and images into the same training in the future.

cubacaban · on Dec 14, 2022

Mapping corresponding text and image into the same vector space is exactly how CLIP and other contrastive learning setups work: Have text and computer vision networks embed input data, and teach the networks to embed related inputs closely and unrelated ones far away. You can train the models on data scraped from the internet (e.g. images and corresponding captions) in a self-supervised way. Like you say it has huge applications for search and is also used by the generative models (to guide a generated image towards your textual input).

BTW it seems that the data2vec is at its core actually just one model, only the input and output parts are different depending on the modality (text, image sound etc.) I wouldn't expect the learned representations to be similar for similar content across modalities. The point of data2vec is to use just one model with a very general self-supervised learning setup for tasks, with the different tasks hopefully benefitting from each other during training.

tnzk · on Dec 14, 2022

It would be really nice if it was like that, and I guess they also hoped that they would get such a result "unexpectedly."

I wonder if there was a technique to regulate the training so their higher layer will be similar despite the input are fundamentally different.

ttul · on Dec 14, 2022

This is basically how stable diffusion works.

pineapple_sauce · on Dec 14, 2022

Not quite. Stable Diffusion is CLIP guided, but I wouldn't say it's "basically" how it works. It's one of the core

Also, just fyi, Google's alternative to Stable Diffusion/DALLE-2, Imagen is not CLIP guided. Thus being CLIP guided is not necessary for the problem.

bilsbie · on Dec 13, 2022

What would be some use cases for this?

How does it work?

nl · on Dec 13, 2022

> What would be some use cases for this?

It can do image recognition, speech recognition or text classification (eg, is this positive or negative sentiment) using the same model architecture (trained on different data each time)

It's very competitive in each of those fields with the existing state of the art. This is interesting because usually the models for each of those fields are different.

> How does it work?

It's trained by passing sequences of data (pixels sequentially, words in order, speech as a wav file) with part of each sequence masked out. The model has to learn to correctly guess what is in that masked area.

The innovation is that it instead of predicting the masked area directly, it tries to predict what the representation of that masked area is in the Neural Network itself. That's extremely unusual (I don't think I've seen that before) and I'll need to study it more to completely understand why that explains the better performance.

It's also very fast to train.

sdenton4 · on Dec 14, 2022

Hubert does something similar; it predicts masked spectrogram regions during its first phase, and then masked embeddings later in training. It's a bit of a pain to get the complicated training regimen to work well, though.

deskamess · on Dec 14, 2022

What is meant by "what the representation of that masked area" vs "masked area directly" ?

notsoprocoder · on Dec 13, 2022

I’ve only scanned the blog but have read OG Data2Vec paper.

Data2Vec presents an architecture that performs well across the main benchmarks for vision, speech and text. The architecture is a variation on the transformer network with slight tweaks for each learning modality. Data2Vec 2 seems to be more a more efficient variant.

In terms of applications, data2vec gives a single reliable architecture for each approach. Whereas before you may have used a CNN for vision and a transformer for text etc.

Additionally, this research is building towards multi-modal learning where an architecture could be trained on images, text and speech to learn about a topic. (But to my knowledge there isn’t anything ground breaking in this space yet).

m00x · on Dec 13, 2022

Not only does it perform better as a generalist, it also groups embeddings for the the multiple modalities of the same concept close together (keeps bird sounds close to bird image in its internal representation, etc). This way it can learn in different dimensions, just like humans do.

You could ask it to sing like a bird in text, and it could respond in image + sound, or just sound. This allows for better general understanding of things. Deepmind did the same with Gato: https://www.deepmind.com/publications/a-generalist-agent

notsoprocoder · on Dec 14, 2022

I can’t see where in the paper Data2Vec is reported to have contextual embeddings that work across modalities. Can you reference the section in the paper?

m00x · on Dec 15, 2022

Different models for now, but it's essentially doing this through the teacher model. If you go down to the numbers, it's very similar. They're saying they want to do this later.

Gato does this in the same embeddings IIRC, but I skimmed the paper so I could be wrong.

notsoprocoder · on Dec 15, 2022

Thanks! I should probably read the gato paper. If only there were enough hours in the day.

bilsbie · on Dec 13, 2022

Thanks! Nice write up

I think it would help if I knew of a concrete task I could use this for.

nl · on Dec 14, 2022

Image or speech recognition.

zone411 · on Dec 14, 2022

They're using a CNN for the decoder, not a transformer.

notsoprocoder · on Dec 14, 2022

They are using CNNs to encode the input. But…

Data2Vec uses a “standard” transformer architecture.[1] In Data2Vec2 the transformer architecture forms the “bulk of the model weights”.[2]

The tweaks I eluded to should have more clearly referenced that each modality uses a different encoder.

[1] section 3.1 data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. [2] section 3.2 Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language.

itake · on Dec 13, 2022

I wish they would provide an "Explain like I am 5" to everyone that doesn't know ML.

nl · on Dec 14, 2022

The issue here is it is sort of like taking a technical blog post involving say details of the Haskell types vs Typescript types for a specific problem and explaining it to someone who doesn't know programming.

There's a bunch of context you kind of need for it to make sense.

I'm not convinced that each technical blog post should provide that context every time.