Ai:embeddings:backfill - Handling OpenAI's 400 Error for Excessive Tokens in Embeddings

I am trying to perform this rake task using the OpenAI embeddings:

I get an error message

[…]:/var/www/discourse# rake ai:embeddings:backfill --trace
** Invoke ai:embeddings:backfill (first_time)
** Invoke environment (first_time)
** Execute environment
** Execute ai:embeddings:backfill
.rake aborted!
Net::HTTPBadResponse: Net::HTTPBadResponse (Net::HTTPBadResponse)
/var/www/discourse/plugins/discourse-ai/lib/inference/open_ai_embeddings.rb:27:in perform!' /var/www/discourse/plugins/discourse-ai/lib/embeddings/vector_representations/text_embedding_ada_002.rb:36:in vector_from’
/var/www/discourse/plugins/discourse-ai/lib/embeddings/vector_representations/base.rb:144:in generate_representation_from' /var/www/discourse/plugins/discourse-ai/lib/tasks/modules/embeddings/database.rake:19:in block (2 levels) in ’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:71:in each' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:71:in block in find_each’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:138:in block in find_in_batches' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:245:in block in in_batches’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:229:in loop' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:229:in in_batches’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:137:in find_in_batches' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/activerecord-7.0.8/lib/active_record/relation/batches.rb:70:in find_each’
/var/www/discourse/plugins/discourse-ai/lib/tasks/modules/embeddings/database.rake:17:in block in <main>' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/task.rb:281:in block in execute’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/task.rb:281:in each' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/task.rb:281:in execute’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/task.rb:219:in block in invoke_with_call_chain' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/task.rb:199:in synchronize’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/task.rb:199:in invoke_with_call_chain' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/task.rb:188:in invoke’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:182:in invoke_task' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:138:in block (2 levels) in top_level’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:138:in each' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:138:in block in top_level’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:147:in run_with_threads' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:132:in top_level’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:83:in block in run' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:208:in standard_exception_handling’
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/rake-13.1.0/lib/rake/application.rb:80:in run' bin/rake:13:in <top (required)>’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/cli/exec.rb:58:in load' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/cli/exec.rb:58:in kernel_load’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/cli/exec.rb:23:in run' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/cli.rb:451:in exec’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/vendor/thor/lib/thor/command.rb:28:in run' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in invoke_command’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/vendor/thor/lib/thor.rb:527:in dispatch' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/cli.rb:34:in dispatch’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/vendor/thor/lib/thor/base.rb:584:in start' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/cli.rb:28:in start’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/exe/bundle:28:in block in <top (required)>' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/lib/bundler/friendly_errors.rb:117:in with_friendly_errors’
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.5.3/exe/bundle:20:in <top (required)>' /usr/local/bin/bundle:25:in load’
/usr/local/bin/bundle:25:in `’
Tasks: TOP => ai:embeddings:backfill

You don’t have to read it though, I figured out what the problem is.

An HTTPBadResponse is being thrown from the line below:

the /logs page shows this:

OpenAI Embeddings failed with status: 400 body: {
  "error": {
    "message": "This model's maximum context length is 8192 tokens, however you requested 8506 tokens (8506 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}

Basically it seems that there is something that is too many words. I’m not sure what the difference is between the “prompt” and the “completion” in this context. Either way, this is preventing me from backfilling.

I have changed the maximum post limit in the site settings, so maybe it’s being caused by some really long post? In this case though, I would expect this post content to be truncated or perhaps that post should just be skipped over? Either way, it’s blocking the backfilling process entirely.

1 Like

Thanks for the report, will take a look on Monday.

2 Likes

We are truncating the content before sending it away using own OpenAI Tokenizer, so this is an unexpected error.

Can you share the problematic text?

All I can see is the stacktrace and the 400 error. Is there a place I can look to see what the request was? Otherwise I don’t know what text is causing the issue.

Since you are running the rake task, can you edit the file at

and add a puts t.id between lines 18 and 19 to print the topic ID.

Thank you for your guidance. I think I found one hell of an edge case.

The problem was Zalgo text

That is, this stuff:
image

This hello world becomes 607 characters with all the junk on it

There was a post with a bunch of it so I deleted it. The backfill was able to proceed. Probably not a high priority issue but I can’t be the only one with a post like this on their forum.

Ohh that’s interesting. I guess it triggers an issue with the OpenAI tokenizer, that makes our counting be wrong.

2 Likes

This may actually be a bug in the official tokenizer!!

Our counting totally lines up!

Also… looking at token counts, zalgo text is 1 hell of an attack on AI, given it inflates token counts for so little value.

@piffy any chance you can paste in the exact text you had into https://platform.openai.com/tokenizer to see if token counts line up with what the API says, there may be a repro for Open AI here.

2 Likes


Above is the raw post content I see when I click “Edit”

For more context, this was a failure of the topic to be embedded, so I don’t know the implementation details on how a full topic is embedded. But I can tell you that removing this one post took it from not working to working.

I can send you the original message in chat, I feel like posting it in this thread may just recreate the problem here :laughing: