I looked at this, and thought about it, and then I waited for an hour, and now I looked at it again, and I can't help but think this is useless.
We can already weigh parts of prompts, we can already specify colors or styles for parts of the images. And even if we could not, none of this needs rich text.
In the beginning I even think their comparisons are dishonest. They compare "plaintext" prompts with "rich text" prompts, but the rich text prompts contain more information. What? Like, seriously, who is surprised the following two prompts give different images?
(1) "A girl with long hair sitting in a cafe, by a table with coffee on it, best quality, ultra detailed, dynamic pose."
(2) "A girl with long [Richtext:orange] hair sitting in a cafe, by a table with coffee on it, best quality, ultra detailed, dynamic pose. [Footnote:The ceramic coffee cup with intricate design, a dance of earthy browns and delicate gold accents. The dark, velvety latte is in it.]"
the worst part is "Font style indicates the styles of local regions". In the comparison with other methods section they actually have to specify in parentheses what each font means style-wise, because nobody knows and (let's be frank) nobody wants to learn.
So why not just use these plaintext parentheses in the prompt?
I really stopped myself from immediately posting my (rather negative) opinion, but after over an hour, it hasn't changed. As far as i can see, this isn't useful, rich text prompts are a gimmick.
Thanks a lot for the comment! (one of the authors here)
RE: plaintext
- The "plain-text" result is just a baseline. We call the "plaintext parentheses in the prompt" full-text (i.e., expanding the rich text info into a long sentence). We show many "full-text" in the paper https://arxiv.org/pdf/2304.06720.pdf.
You can see in Fig 11 that full-text results cannot change the color, style, and do not respect the description. More examples in Figure 13, 14, and 15.
The main issue of using full-text is that it cannot preserve the original plain text image, thereby requiring many rounds of prompt tuning/engineering. We also compared with two other image editing methods, Prompt-to-prompt and InstructPix2Pix. But they could not handle localized editing well. You can see some example comparisons for Color (Figure 4), Style (Figure 5), Footnote (Figure 8), and Font Size (Figure 9). https://arxiv.org/pdf/2304.06720.pdf
RE: Style
- Yes, you can specify what styles you want by just describing it.
I get now that using the side-by-side "plain text"/"rich text" comparisons you're trying to highlight how similar they are, only differing in the regions that are annotated in the rich-text version. But my first impression was that you're comparing against a weak baseline, which doesn't look so good.
The rich text presentation is merely cute. But, the underlying feature is very nice. Being able to focus details on a specific aspect of an image without worrying about it leaking into other aspects would be greatly appreciated.
How about a plain-text interface like this?
> A girl with [long hair](orange) sitting in a cafe, by a table with [coffee](^1) on it, best quality, ultra detailed, dynamic pose. [^1](Ceramic coffee cup with intricate design, a dance of earthy browns and delicate gold accents. The dark, velvety latte is in it.)
It feels like that is where the real value is. Imagine describing all the assets of a game, story, or something larger than just a single image as mainly "what" descriptions, referring to broad styles of things. And then a second body of text detailing those styles in detail.
It could be a text description of a fighter or noble wearing coats or armour. And then substitute in different style description of coats and armour depending on the family, class, race or other attributes suitable for the world you're trying to generate.
Yes, you can expand the rich text information into a long sentence. We call this full-text in the paper. The issue of using "full-text" is that it's hard to edit the image interactively. Every time you change the text, you get an entirely different image.
With the same seed, and an extremely similar prompt, why would you get an entirely different image?
If I take seed 9999999 (just example) and my prompts are
(1) "very large gothic church at dusk, spooky, horror, red roses" and
(2) "very large gothic church at dusk, spooky, horror, white roses"
then with all models I tested over the last year or so, you get _very_ similar images, with different colored roses, and (at most) very minor changes eleswhere. this only seems to work if you keep in mind the prompt being parsed left to right, so changes further to the beginning of the prompt have larger effects. Again, of course, you need the same seed.
But, with this said, why would that be any different with plain/full/rich text. Apologies if I am somehow blinkered and asking something really obvious.
Yup, it could be similar, but it mostly only works for very simple prompts (e.g., one subject in the image).
For example, in Figure 11 of the paper (https://arxiv.org/pdf/2304.06720.pdf), you can see that full-text "rustic cabin -> rustic orange cabin" does not turn the cabin orange.
For coloring, the core benefit of our method is that it allows precise color control. For example, it can generate colors with rare names (e.g., Plum Purple or Dodger Blue) or even particular RGB triplets that we cannot describe well with texts.
I had the same thought. The gothic church one, for example. Why wouldn't I just write "A pink gothic church in the sunset" instead of writing "A gothic church" and then having to do the extra steps to turn the word "church" into pink?
Of course, I'm very ignorant of the uses of such tech, so there's probably some usefulness in this.
Because at least with current models, the pink-ness would spread to the rest of the image. You'd end up with not only a pink church but a pink sunset.
It's even worse with styles; midjourney can't do a guitar in one style and the rest of the image in another style. You really only get one style per image.
The value I see is in constructing more complex prompts. Agree with your example but could see myself using this feature for prompts with multiple objects/aspects that require specific details. Probably not much different from inlining all details, just a nice separation of concerns: you can describe the high level requirement first, and then add and tweak individual details.
Exactly, that's the feature that interested me the most. Ideally, the UI for footnotes would be even more rich: e.g. selecting a word would open a small popup to provide more context.
I think the confusion is you're reading this like it is meant to be a presentation of next-gen text-to-image models. It's more like a fancy UI iteration. And I think it can find use cases in different tools.
very reasonable critique, and valuable here despite being negative, because it was well considered. changed my own perspective. thank you for sharing and hope the authors respond.
We can already weigh parts of prompts, we can already specify colors or styles for parts of the images. And even if we could not, none of this needs rich text.
In the beginning I even think their comparisons are dishonest. They compare "plaintext" prompts with "rich text" prompts, but the rich text prompts contain more information. What? Like, seriously, who is surprised the following two prompts give different images?
(1) "A girl with long hair sitting in a cafe, by a table with coffee on it, best quality, ultra detailed, dynamic pose."
(2) "A girl with long [Richtext:orange] hair sitting in a cafe, by a table with coffee on it, best quality, ultra detailed, dynamic pose. [Footnote:The ceramic coffee cup with intricate design, a dance of earthy browns and delicate gold accents. The dark, velvety latte is in it.]"
the worst part is "Font style indicates the styles of local regions". In the comparison with other methods section they actually have to specify in parentheses what each font means style-wise, because nobody knows and (let's be frank) nobody wants to learn.
So why not just use these plaintext parentheses in the prompt?
I really stopped myself from immediately posting my (rather negative) opinion, but after over an hour, it hasn't changed. As far as i can see, this isn't useful, rich text prompts are a gimmick.