校正テキストでテキストが二重挿入される

Moin · 2024 年 11 月 29 日午前 9:42

理由が全く分からないのですが、すべての投稿で発生するわけではなく、特定の投稿でproofreadを使用すると、コンテンツが重複します。

Lilly · 2024 年 11 月 30 日午後 4:41

うーん、リンクされた投稿では再現できましたが、まだ他の場所では見つけられていません。本当に奇妙です

これでも試してみました

しかし、引用文のタイプミスがなければ、引用文を修正しようとすることに気づきました。

Arkshine · 2024 年 11 月 30 日午後 5:20

これはとても奇妙です。

テキストを何も選択しておらず、引用文に修正すべき点があり、あなたが非スタッフユーザーの場合、重複します。

引用文の内容が問題ない場合、重複しません。

sam · 2024 年 12 月 3 日午前 3:44

これは、Qwen @Falco によってトリガーされているバグです。

{
  "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
  "temperature": 0,
  "stop": [
    "\n</output>"
  ],
  "messages": [
    {
      "role": "system",
      "content": "あなたはマークダウン校正者です。ひどいタイプミスや言い回しの問題を修正しますが、ユーザーの元の声はそのまま維持します。\nコードブロックには触れません。校正するテキストを提供します。修正が必要ない場合は、テキストをそのまま返します。\nテキストは <input></input> XML タグの間にあります。\n常に修正されたテキストを <output></output> XML タグの間に返します。\n\n"
    },
    {
      "role": "user",
      "content": "<input>[quote=\"Arkshine, post:1, topic:339163\"]\n:information_source: This component requires Discourse to be current as of [date=2024-11-27 timezone=\"Europe/Paris\"]. \n[/quote]\nDid you update Discourse? You only receive a notification to update when a new beta is released, but new commits are added every day.</input>"
    }
  ]
}

{
  "id": "chatcmpl-752c6aacdc7f496b951592e88d485eb3",
  "object": "chat.completion",
  "created": 1733196730,
  "model": "Qwen/Qwen2.5-32B-Instruct-AWQ",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<input>[quote=\"Arkshine, post:1, topic:339163\"]\n:information_source: This component requires Discourse to be current as of [date=2024-11-27 timezone=\"Europe/Paris\"]. \n[/quote]\nDid you update Discourse? You only receive a notification to update when a new beta is released, but new commits are added daily.</input>\n\n<output>[quote=\"Arkshine, post:1, topic:339163\"]\n:information_source: This component requires Discourse to be current as of [date=2024-11-27 timezone=\"Europe/Paris\"]. \n[/quote]\nDid you update Discourse? You only receive a notification to update when a new beta is released, but new commits are added daily.</output>",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 184,
    "total_tokens": 358,
    "completion_tokens": 174,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

<input> と <output> の両方のタグが返されていることに注意してください。したがって、ここにバグがあります。

github.com/discourse/discourse-ai

lib/ai_helper/assistant.rb

e3f5e86dc


      
          SANITIZE_REGEX_STR =
            %w[term context topic replyTo input output result]
              .map { |tag| "<#{tag}>\\n?|\\n?</#{tag}>" }
              .join("|")
          
          SANITIZE_REGEX = Regexp.new(SANITIZE_REGEX_STR, Regexp::IGNORECASE | Regexp::MULTILINE)
          
          def sanitize_result(result)
            result.gsub(SANITIZE_REGEX, "")
          end

Sanitize regex が入力と出力の両方を保持しています。

API をより慎重に使用するか、出力を求める場合のみ、またはより優れたプロンプトエンジニアリングを行うべきだと思います。

また、興味深いことに、例を送信するのをやめました。例はありますが、@Roman

sam · 2024 年 12 月 3 日午前 4:34

これは、リグレッションの根本的な原因を修正します。

ただし、副作用があります、@Jagster。しばらくの間、英語の例の送信を停止していましたが、再び送信するようになります。これが影響するかどうかお知らせください。

とはいえ、@Roman、これは私には意味が通りません。

SANITIZE_REGEX_STR =
            %w[term context topic replyTo input output result]
              .map { |tag| "<#{tag}>\n?|\n?<\/#{tag}>" }
              .join("|")

そうではなく、こうなるべきではありませんか？

（item はタイトルの提案用ですが、別のパスを取っている可能性があります）

SANITIZE_REGEX_STR =
            %w[output item]
              .map { |tag| "<#{tag}>\n?|\n?<\/#{tag}>" }
              .join("|")

Roman · 2024 年 12 月 3 日午後 12:39

コンテキストを提供するために、ヘルパープロンプトの一部でこれらのタグが使用されています。たとえば、次のようになります。

github.com/discourse/discourse-ai

app/jobs/regular/stream_post_helper.rb

main


      
          reply_to = post.reply_to_post
          
          return unless user.guardian.can_see?(post)
          
          helper_mode = args[:prompt]
          
          if helper_mode == DiscourseAi::AiHelper::Assistant::EXPLAIN
            input = <<~TEXT.strip
              <term>#{args[:text]}</term>
              <context>#{post.raw}</context>
              <topic>#{topic.title}</topic>
              #{reply_to ? "<replyTo>#{reply_to.raw}</replyTo>" : nil}
            TEXT
          else
            input = args[:text]
          end
          
          DiscourseAi::AiHelper::Assistant.new.stream_prompt(
            helper_mode,
            input,
            user,

一部のモデルでは、返信にそれらを含める場合があるため、削除します。

sam · 2024 年 12 月 3 日午後 7:07

フォローされていません。完全な例で説明していただけますか？

モデルが提供するものをサニタイズする際に、なぜ入力タグ内のテキストを出力に保持したいのですか？

（ちなみに、オペレーターは現在稼働中です）

Roman · 2024 年 12 月 3 日午後 8:31

「サニタイズ」という言葉はここでは少し誤解を招きます。私たちは2つの異なる問題を解決したいのです。

出力のみを取得できるようにする。
結果が不自然に見えるタグをすべて削除する。

問題は、(1)に対してあまりにも緩いことです。関連部分は常に``で囲み、それ以外は何も使用しないようにする必要があります。この関連部分を取得したら、結果がきれいに見えるように、他のすべてのタグを削除します (2)。

上記で提供した例を拡張し、なぜ現在これらのタグをすべて削除しているのかを説明するために、シードされた「説明」プロンプトは次のようになります。

https://github.com/discourse/discourse-ai/blob/main/db/fixtures/ai_helper/603_completion_prompts.rb#L157

<term>、<replyTo>はモデルにコンテキストを提供するために使用され、<input>は特定のテキストに焦点を当てるように指示するために使用されます。

問題は、一部のモデルが返信で同じタグを使用していたため、テキストが不自然でユーザーにとって奇妙に見えることでした。最終的な目標は、これらのタグを削除し、「クリーンな」テキストを結果として生成することです。

たとえば、「Not following」の意味の説明を取得したい場合、次のようなものを見たくありません。

<term>Not following</term> この文脈では、ユーザーが説明や述べられている点を理解するのに苦労していることを意味します。(…)

トピック		返信	表示
Proofread breaks quotes Bug pr-welcome , ai , ai-helper	7	144	2025 年 8 月 14 日
Quote not working if I get the whole sentences Bug	3	1021	2017 年 10 月 9 日
Quoting not working Support	74	3980	2021 年 4 月 6 日
Unable to quote multiple paragraphs or anything across different HTML tags Support	39	2621	2020 年 10 月 23 日
Improving quoting quote accuracy Feature	16	214	2025 年 3 月 4 日

校正テキストでテキストが二重挿入される

関連トピック