AIコンテキストでの画像管理

We had some internal questions about image management in AI contexts so I wanted to cover some of the considerations in a public problem.

The problem

LLMs today support multiple modalities. All major vendors now support input as images, some vendors (most notably Google) support output as images.

This leaves Discourse AI with a bit of a problem, how do we present “images” to the LLMs.

Specifically, if we have this example post:

Hello here is a picture of me: 

![image|531x401](upload://xd5Pv36uPIVKBqya8N5BzZGsJrN.png)

And here is another one

![Sam standing next to a window|531x401](upload://xd5Pv36uPIVKBqya8N5BzZGsJrN.png)

The end 

How do we present this to the LLM:

Option 1: Markdown Retained, Images Appended

Approach: Keep all text together, append images at the end.

[
  "Hello here is a picture of me: 

![image|531x401](upload://xd5Pv36uPIVKBqya8N5BzZGsJrN.png)

And here is another one

![Sam standing next to a window|531x401](upload://xd5Pv36uPIVKBqya8N5BzZGsJrN.png)

The end",
  image1,
  image2
]

Option 2: Markdown Retained, Images Embedded Inline

Approach: Interleave text and images to preserve context and order.

[
  "Hello here is a picture of me: 

![image|531x401](upload://xd5Pv36uPIVKBqya8N5BzZGsJrN.png)",
  image1,
  "And here is another one

![Sam standing next to a window|531x401](upload://xd5Pv36uPIVKBqya8N5BzZGsJrN.png)",
  image2,
  "The end"
]

Option 3: Markdown Stripped, Images Appended

Approach: Remove image markdown syntax entirely, append actual images at the end.

[
  "Hello here is a picture of me: 

And here is another one

The end",
  image1,
  image2
]

Option 4: Descriptions Preserved, Markdown Simplified

Approach: Strip Discourse-specific formatting but retain image descriptions for context.

[
  "Hello here is a picture of me: 

And here is another one
Sam standing next to a window

The end",
  image1,
  image2
]

Option 5: Descriptions Inline, Images Embedded

Approach: Replace markdown with descriptions inline, then embed corresponding images.

[
  "Hello here is a picture of me:",
  image1,
  "And here is another one
Sam standing next to a window",
  image2,
  "The end"
]

At the moment our implementation is (1), part of the answer of why is “legacy” old models did not allow us to position the images, the other one is that often people use Discourse to reformat a post, if we strip out upload markers LLM will think we said something else and not be able to reformat a post with images.

Additionally recommendations from LLM vendors like Anthropic are to always place images at the end. It keeps things simplest for the LLM to interpret.

This approach though is very problematic for an LLM like Nano Banana: Image editing in Google Gemini gets a major upgrade.

When I attempted this, the LLM started hallucinating upload markers instead of rendering images.

It makes sense in retrospect.

If we tell an LLM that it just said: upload://xd5Pv36uPIVKBqya8N5BzZGsJrN.png don’t be surprised if it says something weird like that again.

I am mixed on shifting us to (2) and it looks like (3) is the only sane way for “echoing what the LLM just said, to avoid hallucination”… So our solution to this gnarly problem is somewhat mixed.

While doing this work I did explore if I can create an even solution where output and input are treated the same, but I do not think this is practical. (I also tried preserving upload descriptions when they are long enough and so on)

For now though

(1) for inputs into the LLM
(3) for outputs from the LLM

Long term:

(2) for inputs is worth exploring

and strip but retain contextual position on outputs is also worth exploring.


It is a shame no llm vendor now allow you to supply additional metadata for an image with the image.

「いいね!」 3