How to use AI models with custom tokenizers

RBoy · 13 december 2025 om 22:41

I’m trying to use the groq moonshotai/kimi-k2-instruct model. According to the documentation on moonshotai/Kimi-K2-Instruct · Hugging Face this model isnt’ compatible with OpenAI or Gemini tokenizers and it appears to use it’s own custom tokenizer.

Is it possible to configure discourse to use a custom tokenizer for this model, if so how? I don’t see any options under the LLM model for using a custom tokenizer.

This model appears to be far superior to GPT-5 so I’m very interested in using this with discourse BOT to see how effective it can be. (Beats GPT-5 on reasoning; multilingual MMLU: 89%; HLE multilingual: 85%)

Falco · 14 december 2025 om 00:02

TL;DR pick the closest tokenizer and set the maximum context to some thousands less to allow for the difference to not affect you.

RBoy · 14 december 2025 om 12:08

Thanks. So I decided to enroll the services of ChatGPT, Gemini and Grok to help me decided which tokenizer to use; which would be the closest match to the Kimi Instruct TikToken/BPE tokenizer to generate the most accurate output from the model.

I must say modern AI models are fairly representative of human society. They all reasoned out which tokenizer would be best suited and presented their findings, they disagreed on some of the facts and they each had their own thoughts on which one is the best - kinda heading in the same direction but not really a consensus, very much like a human project team - hilarious!!!

BTW, Gemini recommended Qwen (for the relationship between the chinese founders), Grok recommended Llama3 (based on it’s similarity with cl100k_base and overall efficiency) whlle ChatGPT said either Qwen or Llama3 -

sam · 2 maart 2026 om 04:10

Also worth noting … majority of coding agents these days don’t even bother with an accurate tokenizer like Discourse does. They just estimate at 4 letters per token.

cl100k will be plenty fine for the vast majority of use cases on llms with slightly different tokenizers.

Topic		Antwoorden	Weergaven
Adding Semantic Search feature for our self-hosted discourse site Support ai , ai-search	9	233	19 maart 2025
Frustrations on AI spam detector Support spam , ai	9	115	21 december 2025
Configuring OpenRouter language models Integrations ai	0	904	10 december 2024
How to implement Mistral with Embeddings Support related-topics , ai	6	206	11 mei 2025
Inquiry About AI Plugin Options Support ai	7	104	24 november 2025

How to use AI models with custom tokenizers

Gerelateerde topics