AI Image Captioning Feature in Discourse AI Plugin

We’ve introduced an AI Image Captioning feature to the Discourse AI plugin, enabling automatic caption generation for images in posts. This functionality aims to improve content accessibility and enrich visual elements within your community.

Features and Use

  • Automatic AI Captions: Upon uploading an image in the editor, you can generate a caption automatically using AI.
  • Editable Captions: The generated caption can be edited to better suit your content’s context and tone.
  • Enhanced Accessibility: The feature supports creating more accessible content for users relying on screen readers.

How to Use

  1. Upload an image in the Discourse editor.
  2. Click the “Caption with AI” button near the image.
  3. A generated caption will appear, which you can modify.
  4. Accept the caption to include it in your post.


Your feedback is crucial for refining this feature. It’s enabled here on Meta, so please share your experiences, issues, or suggestions here on this topic.

AI Model

This feature supports both the open-source model LLaVa 1.6 or with the OpenAI API.


Funny I used it earliner in this post. I was very impressed. It could read the image and tell what it was about in this post


Noted this on the OpenAI forum


I don’t know how we get mobile users remember to use that, because they have to jump away from editor.

Is that caption used as alt-text too?



We plan on adding JIT reminders in the near future if the reception is good.


2 posts were split to a new topic: Support for prompt customization in DiscourseAI

It can see the plaid shirt, but it can’t detect George Costanza. :rofl:

Jokes aside, this is great especially for accessibility. In previous A11Y reports, missing alt text on images is one of the main items raised, and previously we’ve written all that off since images are user-uploaded content. This now draws a path forward to much, much better accessibility.


In the case of error messages, is there any way to encourage it to caption the main part of the error so the search engine picks up on it?

Some other results

It identifies the third correctly as the IBM EWM tool, but does not recognise 2 as being Rhapsody, and 1 being Vector Davinci. None the less these captions are pretty reasonable.


This is an awesome feature!

But it’s very hard to find. The user needs to hover over the image to see the button and then click it (and most people wion’t know about that).
Even though I knew and was looking for the feature, I had the check the video to get that I need to hover.
IMO it should be “in your face” to be used in the beginning. I’d even make it create the captions by default, without the user having to click anything :drevil:


We will eventually makes those prompts customizable, so this will then be possible.

As a new feature, our idea is to introduce it in a very unobtrusive way to gather feedback, and then make it easier to find and even automatic.


6 posts were split to a new topic: Issues configuring AI image captions

Will that send the (Internet) Image link to the AI Service or upload the Image content or run some “hashing” locally in discourse? Is it server-side or javascript (i.e. exposing the client ip to external service).


It sends a link to the image to the service you selected for the captioning. It happens server-side, as there are credentials involved.

If you want the feature but don’t want to involve third-parties, you can always run LLaVa in your own server.


agreed, however the quality might suuffer from hardware limitations. Maybe you could share some recommendation in regards to model-sizes and quatisation or minimum vram from your experience. (not sure if they have quantized models at all, their “zoo” seems to have only full models).


We are running the full model, but the smallest version of it with Mistral 7B. It’s taking 21GB VRAM in our single A100 servers, and it’s ran via container image.

Sadly the ecosystem for multi-modal models ain’t as mature as the text2text ones, so we can’t yet leverage inference servers like vLLM or TGI and are left with those one-off microservices. This may change this year, multimodal is on vLLM roadmap, but until then we can at least test the waters with those services.


I have some small UX feedback for this. On small images, the “Capture with AI” button blocks not only the image itself but other text in the post, making it hard to review the post when editing.



I am seeing all generated captions (both here and on my site) start with “The image contains” or “An image of” or similar. This seems unnecessary and redundant. Could the prompt be updated to tell it that it doesn’t need to explain that the image is an image?


It is so tricky to hone cause different models have different tolerances, but one plan we have is to allow community owners control over the prompts so they can experiment.


@mattdm You can achieve this simply by pre-seeding the generated answer with “An image of”. This way the LLM thinks that it has already generated the introduction and will generate just the remainder.