校对文本会将文本插入两次

Moin · 2024 年11 月 29 日 09:42

我不知道为什么，而且这并非发生在所有帖子中，但在一个特定的帖子中可以重现，每当我使用 proofread 时，内容都会被复制。

Lilly · 2024 年11 月 30 日 16:41

嗯，我可以在您链接的帖子中重现它，但尚未在其他地方找到。确实很奇怪

我甚至尝试了这张图

但另一个我可以用不同的回复文本重现，即使日期字段不在引用中。我确实注意到，如果回复文本中没有拼写错误，它会尝试更正引用。

Arkshine · 2024 年11 月 30 日 17:20

这太奇怪了。

当您没有选择任何文本时，引用中有一个需要修复的内容，并且您是非员工用户，它会重复。

当引用内容看起来正常时，它不会重复：

sam · 2024 年12 月 3 日 03:44

这是由 Qwen @Falco 触发的一个错误

{
  "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
  "temperature": 0,
  "stop": [
    "\n</output>"
  ],
  "messages": [
    {
      "role": "system",
      "content": "You are a markdown proofreader. You correct egregious typos and phrasing issues but keep the user's original voice.\nYou do not touch code blocks. I will provide you with text to proofread. If nothing needs fixing, then you will echo the text back.\nYou will find the text between <input></input> XML tags.\nYou will ALWAYS return the corrected text between <output></output> XML tags.\n\n"
    },
    {
      "role": "user",
      "content": "<input>[quote=\"Arkshine, post:1, topic:339163\"]\n:information_source: This component requires Discourse to be current as of [date=2024-11-27 timezone=\"Europe/Paris\"]. \n[/quote]\nDid you update Discourse? You only receive a notification to update when a new beta is released, but new commits are added every day.</input>"
    }
  ]
}

{
  "id": "chatcmpl-752c6aacdc7f496b951592e88d485eb3",
  "object": "chat.completion",
  "created": 1733196730,
  "model": "Qwen/Qwen2.5-32B-Instruct-AWQ",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<input>[quote=\"Arkshine, post:1, topic:339163\"]\n:information_source: This component requires Discourse to be current as of [date=2024-11-27 timezone=\"Europe/Paris\"]. \n[/quote]\nDid you update Discourse? You only receive a notification to update when a new beta is released, but new commits are added daily.</input>\n\n<output>[quote=\"Arkshine, post:1, topic:339163\"]\n:information_source: This component requires Discourse to be current as of [date=2024-11-27 timezone=\"Europe/Paris\"]. \n[/quote]\nDid you update Discourse? You only receive a notification to update when a new beta is released, but new commits are added daily.</output>",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 184,
    "total_tokens": 358,
    "completion_tokens": 174,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

请注意，它同时返回了 <input> 和 <output> 标签，所以这里有一个错误。

github.com/discourse/discourse-ai

lib/ai_helper/assistant.rb

e3f5e86dc


      
          SANITIZE_REGEX_STR =
            %w[term context topic replyTo input output result]
              .map { |tag| "<#{tag}>\\n?|\\n?</#{tag}>" }
              .join("|")
          
          SANITIZE_REGEX = Regexp.new(SANITIZE_REGEX_STR, Regexp::IGNORECASE | Regexp::MULTILINE)
          
          def sanitize_result(result)
            result.gsub(SANITIZE_REGEX, "")
          end

Sanitize 正则表达式同时保留了输入和输出。

我想我们应该更明确地使用我们的 API，如果你在校对，只要求输出，或者做一些更好的提示工程。

另外，有趣的是，我们停止发送示例，尽管我们有它们 @Roman

sam · 2024 年12 月 3 日 04:34

这将修复回归的核心问题：

不过，它会带来一个副作用，@Jagster，我们之前停止发送英文示例了，现在我们会再次发送。如果这对您有影响，请告知我们。

话虽如此，@Roman，这对我来说没有意义：

SANITIZE_REGEX_STR =
            %w[term context topic replyTo input output result]
              .map { |tag| "<#{tag}>\n?|\n?<\/#{tag}>" }
              .join("|")

难道不应该是：

（item 是用于标题建议的，但也许它走了不同的路径）

SANITIZE_REGEX_STR =
            %w[output item]
              .map { |tag| "<#{tag}>\n?|\n?<\/#{tag}>" }
              .join("|")

Roman · 2024 年12 月 3 日 12:39

一些助手提示会使用这些标签来提供上下文。例如：

github.com/discourse/discourse-ai

app/jobs/regular/stream_post_helper.rb

main


      
          reply_to = post.reply_to_post
          
          return unless user.guardian.can_see?(post)
          
          helper_mode = args[:prompt]
          
          if helper_mode == DiscourseAi::AiHelper::Assistant::EXPLAIN
            input = <<~TEXT.strip
              <term>#{args[:text]}</term>
              <context>#{post.raw}</context>
              <topic>#{topic.title}</topic>
              #{reply_to ? "<replyTo>#{reply_to.raw}</replyTo>" : nil}
            TEXT
          else
            input = args[:text]
          end
          
          DiscourseAi::AiHelper::Assistant.new.stream_prompt(
            helper_mode,
            input,
            user,

有些模型可能会在回复中包含它们，所以我们会将其删除。

sam · 2024 年12 月 3 日 19:07

不理解，可以提供一个完整的示例来扩展吗？

当我们清理模型提供的内容时，为什么我们希望将文本保留在 input 标签中？

（顺便说一句，操作员现在应该可以工作了）

Roman · 2024 年12 月 3 日 20:31

这里的“清理”一词有点误导。我们要解决两个不同的问题：

确保我们只获得输出，没有其他内容。
确保删除任何使结果看起来不自然的标签。

我们目前的问题是过于宽松地处理（1）。我们需要确保相关部分始终用 和 包裹，并且除此之外不使用任何其他内容。一旦我们获得了这个相关部分，就删除所有其他标签，以确保结果看起来干净（2）。

为了扩展我上面提供的示例，并解释为什么我们目前会删除所有这些标签，以下是初始化的“解释”提示的样子：

https://github.com/discourse/discourse-ai/blob/main/db/fixtures/ai_helper/603_completion_prompts.rb#L157

<term>、<replyTo> 用于向模型提供上下文，而 <input> 用于告诉模型我们希望它专注于该特定文本。

问题是一些模型在回复中使用了相同的标签，这使得文本看起来不自然，用户感到奇怪。最终目标是删除这些标签并产生“干净”的文本作为结果。

例如，当我想获得“Not following”是什么意思的解释时，我不想看到这样的内容：

<term>Not following</term> 在此上下文中意味着用户在理解解释或正在提出的观点时遇到困难。（…）

话题		回复	浏览量
Proofread breaks quotes Bug pr-welcome , ai-helper , ai	7	144	2025 年8 月 14 日
Quote not working if I get the whole sentences Bug	3	1021	2017 年10 月 9 日
Quoting not working Support	74	3978	2021 年4 月 6 日
Unable to quote multiple paragraphs or anything across different HTML tags Support	39	2620	2020 年10 月 23 日
Improving quoting quote accuracy Feature	16	211	2025 年3 月 4 日

校对文本会将文本插入两次

相关话题