Proofread text inserts the text twice

Moin · November 29, 2024, 9:42am

I have no idea why, and it doesn’t happen in all posts, but it’s reproducible in a specific post where, whenever I use proofread, the content is duplicated.

Lilly · November 30, 2024, 4:41pm

Hmmm, I can reproduce it on the post you linked, but I haven’t been able to find it elsewhere yet. Very odd indeed

I even tried with this one

but the other one I could repro with different reply text and even with the date field out of the quote. I did notice that if there was no typo in the reply text, it tried to correct the quote.

Arkshine · November 30, 2024, 5:20pm

This is so strange.

When you are not selecting any text, there is something to fix in the quote, and you are a non-staff user, it duplicates.

When the quote content looks ok, it doesn’t duplicate:

sam · December 3, 2024, 3:44am

This is a bug being triggered by Qwen @Falco

{
  "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
  "temperature": 0,
  "stop": [
    "\n</output>"
  ],
  "messages": [
    {
      "role": "system",
      "content": "You are a markdown proofreader. You correct egregious typos and phrasing issues but keep the user's original voice.\nYou do not touch code blocks. I will provide you with text to proofread. If nothing needs fixing, then you will echo the text back.\nYou will find the text between <input></input> XML tags.\nYou will ALWAYS return the corrected text between <output></output> XML tags.\n\n"
    },
    {
      "role": "user",
      "content": "<input>[quote=\"Arkshine, post:1, topic:339163\"]\n:information_source: This component requires Discourse to be current as of [date=2024-11-27 timezone=\"Europe/Paris\"]. \n[/quote]\nDid you update Discourse? You only receive a notification to update when a new beta is released, but new commits are added every day.</input>"
    }
  ]
}

{
  "id": "chatcmpl-752c6aacdc7f496b951592e88d485eb3",
  "object": "chat.completion",
  "created": 1733196730,
  "model": "Qwen/Qwen2.5-32B-Instruct-AWQ",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<input>[quote=\"Arkshine, post:1, topic:339163\"]\n:information_source: This component requires Discourse to be current as of [date=2024-11-27 timezone=\"Europe/Paris\"]. \n[/quote]\nDid you update Discourse? You only receive a notification to update when a new beta is released, but new commits are added daily.</input>\n\n<output>[quote=\"Arkshine, post:1, topic:339163\"]\n:information_source: This component requires Discourse to be current as of [date=2024-11-27 timezone=\"Europe/Paris\"]. \n[/quote]\nDid you update Discourse? You only receive a notification to update when a new beta is released, but new commits are added daily.</output>",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 184,
    "total_tokens": 358,
    "completion_tokens": 174,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

Notice how it returns BOTH <input> and <output> tags, so we have a bug here.

github.com

discourse/discourse-ai/blob/e3f5e86dc5d1d75d0cd45c9ece385e5198888d48/lib/ai_helper/assistant.rb#L174-L183


      
          SANITIZE_REGEX_STR =
            %w[term context topic replyTo input output result]
              .map { |tag| "<#{tag}>\\n?|\\n?</#{tag}>" }
              .join("|")
          
          SANITIZE_REGEX = Regexp.new(SANITIZE_REGEX_STR, Regexp::IGNORECASE | Regexp::MULTILINE)
          
          def sanitize_result(result)
            result.gsub(SANITIZE_REGEX, "")
          end

Sanitize regex is keeping both input and output.

I guess we should be more deliberate with our API and if you are proofreading only ask for output or do some better prompt engineering.

Also interestingly we stopped sending examples even though we have them @Roman

sam · December 3, 2024, 4:34am

This will fix the core of the regression:

It comes though we a side effect @Jagster , we stopped sending English examples a while back, now we will be sending them again. Let us know if this impacts you.

That said @Roman this does not make sense to me:

SANITIZE_REGEX_STR =
            %w[term context topic replyTo input output result]
              .map { |tag| "<#{tag}>\\n?|\\n?</#{tag}>" }
              .join("|")

Should it not be:

(item is for title suggestions, but maybe its taking a different path)

SANITIZE_REGEX_STR =
            %w[output item]
              .map { |tag| "<#{tag}>\\n?|\\n?</#{tag}>" }
              .join("|")

Roman · December 3, 2024, 12:39pm

Some of the helper prompts use those tags to provide context. For example:

github.com

discourse/discourse-ai/blob/main/app/jobs/regular/stream_post_helper.rb#L25


      
          return unless user.guardian.can_see?(post)
          
          prompt = CompletionPrompt.enabled_by_name(args[:prompt])
          
          if prompt.id == CompletionPrompt::CUSTOM_PROMPT
            prompt.custom_instruction = args[:custom_prompt]
          end
          
          if prompt.name == "explain"
            input = <<~TEXT
          <term>#{args[:text]}</term>
          <context>#{post.raw}</context>
              <topic>#{topic.title}</topic>
              #{reply_to ? "<replyTo>#{reply_to.raw}</replyTo>" : nil}
            TEXT
          else
            input = args[:text]
          end
          
          DiscourseAi::AiHelper::Assistant.new.stream_prompt(
            prompt,

Some models might include them in the reply, so we remove them.

sam · December 3, 2024, 7:07pm

Not following , can you expand with a full example

Why do we want to keep the text in input tags in the output , when we sanitise the stuff the model gives us?

(Op should be working now btw )

Roman · December 3, 2024, 8:31pm

The word “sanitize” is a bit misleading here. We want to solve two different problems:

Make sure we get the output and nothing else.
Make sure to strip any tags that make the result look unnatural.

The problem here is that we are being too lax with (1). We need to ensure that the relevant part is always wrapped by and, and use nothing else. Once we have this relevant part, remove all other tags to ensure the result looks clean (2).

To expand on the example I provided above, and explain why we currently scrub all these tags, this is what the seeded “explain” prompt looks like:

github.com

discourse/discourse-ai/blob/main/db/fixtures/ai_helper/603_completion_prompts.rb#L157


      
              You are a helpful assistant. I will give you instructions inside <input></input> XML tags.
              You will look at them and reply with a result.
            TEXT
          end
          
          CompletionPrompt.seed do |cp|
            cp.id = -306
            cp.name = "explain"
            cp.prompt_type = CompletionPrompt.prompt_types[:text]
            cp.messages = { insts: <<~TEXT }
              You are a tutor explaining a term to a student in a specific context.
          
              I will provide everything you need to know inside <input> tags, which consists of the term I want you
              to explain inside <term> tags, the context of where it was used inside <context> tags, the title of
              the topic where it was used inside <topic> tags, and optionally, the previous post in the conversation
              in <replyTo> tags.
          
              Using all this information, write a paragraph with a brief explanation
              of what the term means. Format the response using Markdown. Reply only with the explanation and
              nothing more.
            TEXT

<term>, <replyTo> are used to provide context to the model, while <input> is to tell we want it to focus on that specific piece of text.

Problem was that some models were using the same tags in their replies, which made the text look unnatural and weird to users. The end goal here is to remove these tags and produce “clean” text as the result.

For example, when I want to get an explanation of what “Not following” means, I don’t want to see something like this:

<term>Not following</tem> in this context means that the user is having trouble understanding the explanation or the point being made. (…)

Topic		Replies	Views
Proofread breaks quotes UX ai , ai-helper	3	87	May 22, 2025
Quote not working if I get the whole sentences Bug	3	1008	October 9, 2017
Quoting not working Support	74	3898	April 6, 2021
Unable to quote multiple paragraphs or anything across different HTML tags Support	39	2570	October 23, 2020
Improving quoting quote accuracy Feature	16	163	March 4, 2025

Proofread text inserts the text twice

Related topics