This uses CLIP to optimize a GAN's input to generate an output matching a text description. Optimization is very slow, it's basically the same process as training. DALL-E uses a feedforward network to directly predict an image from text. But that model hasn't been published yet.