How to use AI models with custom tokenizers

I’m trying to use the groq moonshotai/kimi-k2-instruct model. According to the documentation on moonshotai/Kimi-K2-Instruct · Hugging Face this model isnt’ compatible with OpenAI or Gemini tokenizers and it appears to use it’s own custom tokenizer.

Is it possible to configure discourse to use a custom tokenizer for this model, if so how? I don’t see any options under the LLM model for using a custom tokenizer.

This model appears to be far superior to GPT-5 so I’m very interested in using this with discourse BOT to see how effective it can be. (Beats GPT-5 on reasoning; multilingual MMLU: 89%; HLE multilingual: 85%)

TL;DR pick the closest tokenizer and set the maximum context to some thousands less to allow for the difference to not affect you.

Thanks. So I decided to enroll the services of ChatGPT, Gemini and Grok to help me decided which tokenizer to use; which would be the closest match to the Kimi Instruct TikToken/BPE tokenizer to generate the most accurate output from the model.

I must say modern AI models are fairly representative of human society. They all reasoned out which tokenizer would be best suited and presented their findings, they disagreed on some of the facts and they each had their own thoughts on which one is the best - kinda heading in the same direction but not really a consensus, very much like a human project team - hilarious!!! :rofl:

BTW, Gemini recommended Qwen (for the relationship between the chinese founders), Grok recommended Llama3 (based on it’s similarity with cl100k_base and overall efficiency) whlle ChatGPT said either Qwen or Llama3 - :joy: