This prompt is best served by Midjourney, Flux, Stable Diffusion. It'll be far cheaper, and chances are it'll also look a lot better.
The place where gpt-image-1 shines if if you want to do a prompt like:
"a cute dog hugs a cute cat, they're both standing on top of an algebra equation (y=\(2x^{2}-3x-2\)). Use the first reference image I uploaded as a source for the style of the dog. Same breed, same markings. The cat can contrast in fur color. Use the second reference image I uploaded as a guide for the background, but change the lighting to sunset. Also, solve the equation for x."
gpt-image-1 doesn't make the best images, and it isn't cheap, and it isn't fast, but it's incredibly -- almost insanely -- powerful. It feels like ComfyUI got packed up into an LLM and provided as a natural language service.
I wonder if we can use gpt-image-1 outputs, with some noise, as inputs to diffusion models, so GPT takes care of adherence and the diffusion model improves the quality. Does anyone know whether that's at all possible?
Sure. I suppose with API support 3 hours ago someone probably made a Comfy node all of 2 hours ago. From there you can either just do a low denoise or use one of the many IP-Adapter type things out there.
yes it's what a lot of people have been doing with newer models which have better prompt adherence, passing them through older models with better aesthetics
Prompt: “a cute dog hugs a cute cat”
https://x.com/terrylurie/status/1915161141489136095
I also then showed a couple of DALL:E 3 images for comparison in a comment