I looked at this, and thought about it, and then I waited for an hour, and now I...

jbhuang0604 · on Oct 5, 2023

Thanks a lot for the comment! (one of the authors here)

RE: plaintext - The "plain-text" result is just a baseline. We call the "plaintext parentheses in the prompt" full-text (i.e., expanding the rich text info into a long sentence). We show many "full-text" in the paper https://arxiv.org/pdf/2304.06720.pdf. You can see in Fig 11 that full-text results cannot change the color, style, and do not respect the description. More examples in Figure 13, 14, and 15.

The main issue of using full-text is that it cannot preserve the original plain text image, thereby requiring many rounds of prompt tuning/engineering. We also compared with two other image editing methods, Prompt-to-prompt and InstructPix2Pix. But they could not handle localized editing well. You can see some example comparisons for Color (Figure 4), Style (Figure 5), Footnote (Figure 8), and Font Size (Figure 9). https://arxiv.org/pdf/2304.06720.pdf

RE: Style - Yes, you can specify what styles you want by just describing it.

It could be a simple word like "Ukiyo-e" or "Van Gogh", or some detailed descriptions. You can check out some examples in the video: https://youtu.be/ihDbAUh0LXk?si=wVF9LIF1NVqLtNDC&t=59

The particular font family used is just a "label".

Glad that you posted the comments! I hope this clarifies. Happy to answer any questions.

yorwba · on Oct 5, 2023

I get now that using the side-by-side "plain text"/"rich text" comparisons you're trying to highlight how similar they are, only differing in the regions that are annotated in the rich-text version. But my first impression was that you're comparing against a weak baseline, which doesn't look so good.

Not sure how this could be communicated better.

jbhuang0604 · on Oct 5, 2023

Got it! Thanks for the feedback! This is definitely something we can improve.

simbolit · on Oct 5, 2023

This!

tudorw · on Oct 5, 2023

Very interesting. Is there a way to retrieve the segment information, then surface that in a UI so I can select and regenerate single elements?

jbhuang0604 · on Oct 6, 2023

Yes, if you go to the huggingface demo: https://huggingface.co/spaces/songweig/rich-text-to-image

You can find the segmentation information on the bottom-right of the rich-text generation result.

aenvoker · on Oct 4, 2023

The rich text presentation is merely cute. But, the underlying feature is very nice. Being able to focus details on a specific aspect of an image without worrying about it leaking into other aspects would be greatly appreciated.

How about a plain-text interface like this?

> A girl with [long hair](orange) sitting in a cafe, by a table with [coffee](^1) on it, best quality, ultra detailed, dynamic pose. [^1](Ceramic coffee cup with intricate design, a dance of earthy browns and delicate gold accents. The dark, velvety latte is in it.)

phil-martin · on Oct 4, 2023

It feels like that is where the real value is. Imagine describing all the assets of a game, story, or something larger than just a single image as mainly "what" descriptions, referring to broad styles of things. And then a second body of text detailing those styles in detail.

It could be a text description of a fighter or noble wearing coats or armour. And then substitute in different style description of coats and armour depending on the family, class, race or other attributes suitable for the world you're trying to generate.

jbhuang0604 · on Oct 5, 2023

Exactly! You get to have full control over the detailed contents you wish to generate.

jbhuang0604 · on Oct 5, 2023

Yes, you can expand the rich text information into a long sentence. We call this full-text in the paper. The issue of using "full-text" is that it's hard to edit the image interactively. Every time you change the text, you get an entirely different image.

simbolit · on Oct 5, 2023

I am sorry, but I really don't understand.

With the same seed, and an extremely similar prompt, why would you get an entirely different image?

If I take seed 9999999 (just example) and my prompts are

(1) "very large gothic church at dusk, spooky, horror, red roses" and

(2) "very large gothic church at dusk, spooky, horror, white roses"

then with all models I tested over the last year or so, you get _very_ similar images, with different colored roses, and (at most) very minor changes eleswhere. this only seems to work if you keep in mind the prompt being parsed left to right, so changes further to the beginning of the prompt have larger effects. Again, of course, you need the same seed.

But, with this said, why would that be any different with plain/full/rich text. Apologies if I am somehow blinkered and asking something really obvious.

jbhuang0604 · on Oct 6, 2023

Yup, it could be similar, but it mostly only works for very simple prompts (e.g., one subject in the image).

For example, in Figure 11 of the paper (https://arxiv.org/pdf/2304.06720.pdf), you can see that full-text "rustic cabin -> rustic orange cabin" does not turn the cabin orange.

For coloring, the core benefit of our method is that it allows precise color control. For example, it can generate colors with rare names (e.g., Plum Purple or Dodger Blue) or even particular RGB triplets that we cannot describe well with texts.

You can examples in Figure 4 here: https://arxiv.org/pdf/2304.06720.pdf

EL_Loco · on Oct 4, 2023

I had the same thought. The gothic church one, for example. Why wouldn't I just write "A pink gothic church in the sunset" instead of writing "A gothic church" and then having to do the extra steps to turn the word "church" into pink? Of course, I'm very ignorant of the uses of such tech, so there's probably some usefulness in this.

Legend2440 · on Oct 4, 2023

Because at least with current models, the pink-ness would spread to the rest of the image. You'd end up with not only a pink church but a pink sunset.

It's even worse with styles; midjourney can't do a guitar in one style and the rest of the image in another style. You really only get one style per image.

90-00-09 · on Oct 4, 2023

The value I see is in constructing more complex prompts. Agree with your example but could see myself using this feature for prompts with multiple objects/aspects that require specific details. Probably not much different from inlining all details, just a nice separation of concerns: you can describe the high level requirement first, and then add and tweak individual details.

jbhuang0604 · on Oct 5, 2023

Yes, I think the "footnote" showcases this well. You can use it to interactively explore your visual imagination.

Some examples here: https://youtu.be/ihDbAUh0LXk?si=i3LFfkDXIDKKvne3&t=91

90-00-09 · on Oct 5, 2023

Exactly, that's the feature that interested me the most. Ideally, the UI for footnotes would be even more rich: e.g. selecting a word would open a small popup to provide more context.

jbhuang0604 · on Oct 7, 2023

Yes! I am particularly excited about this feature.

teaearlgraycold · on Oct 5, 2023

I think the confusion is you're reading this like it is meant to be a presentation of next-gen text-to-image models. It's more like a fancy UI iteration. And I think it can find use cases in different tools.

swyx · on Oct 5, 2023

very reasonable critique, and valuable here despite being negative, because it was well considered. changed my own perspective. thank you for sharing and hope the authors respond.