These conversations so routinely devolve into crowdsourced attempts to define notoriously tricky words like “language”, and “intelligence”.
These absurdly big, semi-supervised transformers are predicting what the next pixel or word or Atari move is. They’re strikingly good at it. To accomplish this they build up a latent space where all the pictures of sunglasses and the word “shades” are cosine similar, and quite different to “dog” or a picture of a dog, and have an operator (in word2vec, addition, in DALL-E, something nonlinear) that can put sunglasses on a dog.
Is that latent space and all the embeddings into it a “language”? Who cares? It works and it’s fucking cool.
These absurdly big, semi-supervised transformers are predicting what the next pixel or word or Atari move is. They’re strikingly good at it. To accomplish this they build up a latent space where all the pictures of sunglasses and the word “shades” are cosine similar, and quite different to “dog” or a picture of a dog, and have an operator (in word2vec, addition, in DALL-E, something nonlinear) that can put sunglasses on a dog.
Is that latent space and all the embeddings into it a “language”? Who cares? It works and it’s fucking cool.