AI:embeddings:backfill - 处理 OpenAI 的 400 错误,因 Embeddings 中 Token 超限

我正在尝试使用 OpenAI 嵌入来执行此 rake 任务:

我收到一条错误消息

[:/var/www/discourse# rake ai:embeddings:backfill --trace
** Invoke ai:embeddings:backfill (first_time)
** Invoke environment (first_time)
** Execute environment
** Execute ai:embeddings:backfill
.rake aborted!
Net::HTTPBadResponse: Net::HTTPBadResponse (Net::HTTPBadResponse)
/var/www/discourse/plugins/discourse-ai/lib/inference/open_ai_embeddings.rb:27:in `perform!’
/var/www/discourse/plugins/discourse-ai/lib/embeddings/vector_representations/text_embedding_ada_002.rb:36:in `vector_from’
/var/www/discourse/plugins/discourse-ai/lib/embeddings/vector_representations/base.rb:144:in `generate_representation_from’
/var/www/discourse/plugins/discourse-ai/lib/tasks/modules/embeddings/database.rake:19:in `block (2 levels) in ’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:71:in `each’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:71:in `block in find_each’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:138:in `block in find_in_batches’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:245:in `block in in_batches’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:229:in `loop’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:229:in `in_batches’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:137:in `find_in_batches’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:70:in `find_each’
/var/www/discourse/plugins/discourse-ai/lib/tasks/modules/embeddings/database.rake:17:in `block in ’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/task.rb:281:in `block in execute’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/task.rb:281:in `each’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/task.rb:281:in `execute’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/task.rb:219:in `block in invoke_with_call_chain’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/task.rb:199:in `synchronize’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/task.rb:199:in `invoke_with_call_chain’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/task.rb:188:in `invoke’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:182:in `invoke_task’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:138:in `block (2 levels) in top_level’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:138:in `each’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:138:in `block in top_level’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:147:in `run_with_threads’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:132:in `top_level’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:83:in `block in run’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:208:in `standard_exception_handling’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:80:in `run’
bin/rake:13:in `<top (required)>’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/cli/exec.rb:58:in `load’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/cli/exec.rb:58:in `kernel_load’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/cli/exec.rb:23:in `run’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/cli.rb:451:in `exec’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/vendor/thor/lib/thor/command.rb:28:in `run’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/vendor/thor/lib/thor.rb:527:in `dispatch’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/cli.rb:34:in `dispatch’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/vendor/thor/lib/thor/base.rb:584:in `start’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/cli.rb:28:in `start’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/exe/bundle:28:in `block in <top (required)>’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/friendly_errors.rb:117:in `with_friendly_errors’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/exe/bundle:20:in `<top (required)>’
/usr/local/bin/bundle:25:in `load’
/usr/local/bin/bundle:25:in `’
Tasks: TOP => ai:embeddings:backfill

你不必阅读它,我已经弄清楚问题所在了。

HTTPBadResponse 是从下面这行抛出的:

/logs 页面显示这个:

OpenAI Embeddings failed with status: 400 body: {
  "error": {
    "message": "This model's maximum context length is 8192 tokens, however you requested 8506 tokens (8506 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}

基本上,似乎有什么东西的单词太多了。我不确定在这种情况下“prompt”和“completion”之间的区别。无论如何,这阻止了我进行回填。

我已经更改了站点设置中的最大帖子限制,所以也许是由一个非常长的帖子引起的?但如果是这样,我期望这个帖子的内容被截断,或者这个帖子应该被跳过?无论如何,它完全阻止了回填过程。

1 个赞

谢谢你的报告,周一会查看。

2 个赞

我们正在使用我们自己的 OpenAI Tokenizer 在发送之前截断内容,因此这是一个意外错误。

您能分享有问题的文本吗?

我只能看到堆栈跟踪和 400 错误。有没有什么地方可以查看请求的内容?否则,我不知道是什么文本导致了这个问题。

由于您正在运行 rake 任务,请编辑该文件:

并在第 18 行和第 19 行之间添加一个 puts t.id 来打印主题 ID。

感谢您的指导。我认为我发现了一个非常棘手的极端情况。

问题是 Zalgo 文本

也就是说,这种东西:

这个“hello world”加上所有这些垃圾字符后变成了 607 个字符。

有一个帖子包含了很多这样的内容,所以我删除了它。之后就可以继续回填了。这可能不是一个高优先级的问,但我的论坛上不可能只有我一个人有这样的帖子。

哦,那真有意思。我猜这会触发一个与 OpenAI 分词器有关的问题,导致我们的计数出错。

2 个赞

这可能真的是官方分词器的一个 bug!

我们的计数完全一致!

另外……看看分词计数,Zalgo 文本对 AI 来说是一种非常可怕的攻击,因为它只用很少的价值就极大地增加了分词计数。

@piffy 你有没有可能把你的确切文本粘贴到 https://platform.openai.com/tokenizer 上,看看分词计数是否与 API 所说的相符,这可能是 OpenAI 的一个复现点。

2 个赞


以上是我点击“编辑”时看到的原始帖子内容。

更多背景信息是,这是一个主题未能嵌入的失败,所以我不知道嵌入整个主题的实现细节。但我可以告诉你,删除这个帖子就解决了问题。

我可以把原始消息发给你,我觉得把消息发到这个帖子里可能会在这里重现问题 :laughing:

已修复

1 个赞