Proofread text inserts the text twice

Moin · November 29, 2024, 9:42am

I have no idea why, and it doesn’t happen in all posts, but it’s reproducible in a specific post where, whenever I use proofread, the content is duplicated.

Lilly · November 30, 2024, 4:41pm

Hmmm, I can reproduce it on the post you linked, but I haven’t been able to find it elsewhere yet. Very odd indeed

I even tried with this one

but the other one I could repro with different reply text and even with the date field out of the quote. I did notice that if there was no typo in the reply text, it tried to correct the quote.

Arkshine · November 30, 2024, 5:20pm

This is so strange.

When you are not selecting any text, there is something to fix in the quote, and you are a non-staff user, it duplicates.

When the quote content looks ok, it doesn’t duplicate:

sam · December 3, 2024, 3:44am

This is a bug being triggered by Qwen @Falco

{
  "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
  "temperature": 0,
  "stop": [
    "\n</output>"
  ],
  "messages": [
    {
      "role": "system",
      "content": "You are a markdown proofreader. You correct egregious typos and phrasing issues but keep the user's original voice.\nYou do not touch code blocks. I will provide you with text to proofread. If nothing needs fixing, then you will echo the text back.\nYou will find the text between <input></input> XML tags.\nYou will ALWAYS return the corrected text between <output></output> XML tags.\n\n"
    },
    {
      "role": "user",
      "content": "<input>[quote=\"Arkshine, post:1, topic:339163\"]\n:information_source: This component requires Discourse to be current as of [date=2024-11-27 timezone=\"Europe/Paris\"]. \n[/quote]\nDid you update Discourse? You only receive a notification to update when a new beta is released, but new commits are added every day.</input>"
    }
  ]
}

{
  "id": "chatcmpl-752c6aacdc7f496b951592e88d485eb3",
  "object": "chat.completion",
  "created": 1733196730,
  "model": "Qwen/Qwen2.5-32B-Instruct-AWQ",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<input>[quote=\"Arkshine, post:1, topic:339163\"]\n:information_source: This component requires Discourse to be current as of [date=2024-11-27 timezone=\"Europe/Paris\"]. \n[/quote]\nDid you update Discourse? You only receive a notification to update when a new beta is released, but new commits are added daily.</input>\n\n<output>[quote=\"Arkshine, post:1, topic:339163\"]\n:information_source: This component requires Discourse to be current as of [date=2024-11-27 timezone=\"Europe/Paris\"]. \n[/quote]\nDid you update Discourse? You only receive a notification to update when a new beta is released, but new commits are added daily.</output>",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 184,
    "total_tokens": 358,
    "completion_tokens": 174,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

Notice how it returns BOTH <input> and <output> tags, so we have a bug here.

github.com/discourse/discourse-ai

lib/ai_helper/assistant.rb

e3f5e86dc


      
          SANITIZE_REGEX_STR =
            %w[term context topic replyTo input output result]
              .map { |tag| "<#{tag}>\\n?|\\n?</#{tag}>" }
              .join("|")
          
          SANITIZE_REGEX = Regexp.new(SANITIZE_REGEX_STR, Regexp::IGNORECASE | Regexp::MULTILINE)
          
          def sanitize_result(result)
            result.gsub(SANITIZE_REGEX, "")
          end

Sanitize regex is keeping both input and output.

I guess we should be more deliberate with our API and if you are proofreading only ask for output or do some better prompt engineering.

Also interestingly we stopped sending examples even though we have them @Roman

sam · December 3, 2024, 4:34am

This will fix the core of the regression:

It comes though we a side effect @Jagster , we stopped sending English examples a while back, now we will be sending them again. Let us know if this impacts you.

That said @Roman this does not make sense to me:

SANITIZE_REGEX_STR =
            %w[term context topic replyTo input output result]
              .map { |tag| "<#{tag}>\\n?|\\n?</#{tag}>" }
              .join("|")

Should it not be:

(item is for title suggestions, but maybe its taking a different path)

SANITIZE_REGEX_STR =
            %w[output item]
              .map { |tag| "<#{tag}>\\n?|\\n?</#{tag}>" }
              .join("|")

Roman · December 3, 2024, 12:39pm

Some of the helper prompts use those tags to provide context. For example:

github.com/discourse/discourse-ai

app/jobs/regular/stream_post_helper.rb

main


      
          reply_to = post.reply_to_post
          
          return unless user.guardian.can_see?(post)
          
          helper_mode = args[:prompt]
          
          if helper_mode == DiscourseAi::AiHelper::Assistant::EXPLAIN
            input = <<~TEXT.strip
              <term>#{args[:text]}</term>
              <context>#{post.raw}</context>
              <topic>#{topic.title}</topic>
              #{reply_to ? "<replyTo>#{reply_to.raw}</replyTo>" : nil}
            TEXT
          else
            input = args[:text]
          end
          
          DiscourseAi::AiHelper::Assistant.new.stream_prompt(
            helper_mode,
            input,
            user,

Some models might include them in the reply, so we remove them.

sam · December 3, 2024, 7:07pm

Not following , can you expand with a full example

Why do we want to keep the text in input tags in the output , when we sanitise the stuff the model gives us?

(Op should be working now btw )

Roman · December 3, 2024, 8:31pm

The word “sanitize” is a bit misleading here. We want to solve two different problems:

Make sure we get the output and nothing else.
Make sure to strip any tags that make the result look unnatural.

The problem here is that we are being too lax with (1). We need to ensure that the relevant part is always wrapped by and, and use nothing else. Once we have this relevant part, remove all other tags to ensure the result looks clean (2).

To expand on the example I provided above, and explain why we currently scrub all these tags, this is what the seeded “explain” prompt looks like:

https://github.com/discourse/discourse-ai/blob/main/db/fixtures/ai_helper/603_completion_prompts.rb#L157

<term>, <replyTo> are used to provide context to the model, while <input> is to tell we want it to focus on that specific piece of text.

Problem was that some models were using the same tags in their replies, which made the text look unnatural and weird to users. The end goal here is to remove these tags and produce “clean” text as the result.

For example, when I want to get an explanation of what “Not following” means, I don’t want to see something like this:

<term>Not following</tem> in this context means that the user is having trouble understanding the explanation or the point being made. (…)

Topic		Replies	Views
Proofread breaks quotes Bug pr-welcome , ai-helper , ai	7	175	August 14, 2025
Quote not working if I get the whole sentences Bug	3	1026	October 9, 2017
Quoting not working Support	74	4129	April 6, 2021
Unable to quote multiple paragraphs or anything across different HTML tags Support	39	2735	October 23, 2020
Improving quoting quote accuracy Feature	16	250	March 4, 2025

Proofread text inserts the text twice

Related topics