This is what a truly revolutionary idea looks like. There are so many details in the paper. Also, we know that transformers can scale. Pretty sure this idea will be used by a lot of companies to train the general 3D asset creation pipeline. This is just too great.
"We first learn a vocabulary of latent quantized embeddings, using graph convolutions, which inform these embeddings of the local mesh geometry and topology. These embeddings are sequenced and decoded into triangles by a decoder, ensuring that they can effectively reconstruct the mesh."
This idea is simply beautiful and so obvious in hindsight.
"To define the tokens to generate, we consider a practical approach to represent a mesh M for autoregressive generation: a sequence of triangles."
It's cool, it's also par for the field of 3D reconstruction today. I wouldn't describe this paper as particularly innovative or exceptional.
What do I think is really compelling in this field (given that it's my profession)?
This has me star-struck lately -- 3D meshing from a single image, a very large 3D reconstruction model trained on millions of all kinds of 3D models... https://yiconghong.me/LRM/
Another thing to note here is this looks to be around seven total days of training on at most 4 A100s. Not all really cutting edge work requires a data center sized cluster.
NNs are typically continuous/differentiable so you can do gradient-based learning on them. We often want to use some of the structure the NN has learned to represent data efficiently. E.g., we might take a pre-trained GPT-type model, and put a passage of text through it, and instead of getting the next-token prediction probability (which GPT was trained on), we just get a snapshot of some of the activations at some intermediate layer of the network. The idea is that these activations will encode semantically useful information about the input text. Then we might e.g. store a bunch of these activations and use them to do semantic search/lookup to find similar passages of text, or whatever.
Quantized embeddings are just that, but you introduce some discrete structure into the NN, such that the representations there are not continuous. A typical way to do this these days is to learn a codebook VQ-VAE style. Basically, we take some intermediate continuous representation learned in the normal way, and replace it in the forward pass with the nearest "quantized" code from our codebook. It biases the learning since we can't differentiate through it, and we just pretend like we didn't take the quantization step, but it seems to work well. There's a lot more that can be said about why one might want to do this, the value of discrete vs continuous representations, efficiency, modularity, etc...
If you’re willing, I’d love your insight on the “why one might want to do this”.
Conceptually I understand embedding quantization, and I have some hint of why it works for things like WAV2VEC - human phonemes are (somewhat) finite so forcing the representation to be finite makes sense - but I feel like there’s a level of detail that I’m missing regarding whats really going on and when quantisation helps/harms that I haven’t been able to gleam from papers.
Quantization also works as regularization; it stops the neural network from being able to use arbitrarily complex internal rules.
But really it's only really useful if you absolutely need to have a discrete embedding space for some sort of downstream usage. VQVAEs can be difficult to get to converge, they have problems stemming from the approximation of the gradient like codebook collapse
Maybe it helps to point out that the first version of Dall-E (of 'baby daikon radish in a tutu walking a dog' fame) used the same trick, but they quantized the image patches.
I mean I don't see a strong reason to turn away from attention as well but I also don't think anyone's thrown a billion parameter MLP or Conv model at a problem. We've put a lot of work into attention, transformers, and scaling these. Thousands of papers each year! Definitely don't see that for other architectures. The ResNet Strikes back paper is a great paper for one reason being that it should remind us all to not get lost in the hype and that our advancements are coupled. We learned a lot of training techniques since the original ResNet days and pushing those to ResNets also makes them a lot better and really closes the gaps. At least in vision (where I research). It is easy to railroad in research where we have publish or perish and hype driven reviewing.
No... a graph convolution is just a convolution (over a graph, like all convolutions).
The difference from a "normal" convolution is that you can consider arbitrary connectivity of the graph (rather than the usual connectivity induced by a regular Euclidian grid), but the underlying idea is the same: to calculate the result of the operation at any single place (i.e., node), you need to perform a linear operation over that place (i.e., node) and its neighbourhood (i.e., connected nodes), the same way that (e.g.) in a convolutional neural network, you calculate the value of a pixel by considering its value and that of its neighbours, when performing a convolution.
"We first learn a vocabulary of latent quantized embeddings, using graph convolutions, which inform these embeddings of the local mesh geometry and topology. These embeddings are sequenced and decoded into triangles by a decoder, ensuring that they can effectively reconstruct the mesh."
This idea is simply beautiful and so obvious in hindsight.
"To define the tokens to generate, we consider a practical approach to represent a mesh M for autoregressive generation: a sequence of triangles."
More from paper. Just so cool!