I’ve been struggling to set up Embeddings with Mistral AI, I suspect because Mistral requires a model to be passed. Do you know whether this is possible (and if so, how), or what should be done to make it possible?
Try setting mistral-embed
in the “Model name” field, that appears after you select “Provider” as OpenAI.
Thanks, that works
I’m struggling to find out what would be the best tokenizer to use for this use case though. The Mixtral tokenizer is not selectable here. Do you have any suggestions?
Your post above token length according to some tokenizers:
OpenAI: 45
Mixtral: 52
Gemini: 47
E5: 50
bge-large-en: 49
bge-m3: 50
mpnet: 49
Looks like Mistral-embed doesn’t differ much from the others. And since it supports a very large context window of 8k, you should be safe picking any and leaving some room to spare by limiting the context window in Discourse to 7 or 7.5k.
Looks like mistral-embed
uses the same tokenizer as the first Mixtral model, and we already ship that anyway, so what do you think about enabling that tokenizer in the embeddings config page @Roman_Rizzi ?
Sure. I don’t see why not if it’s already there. This change will add it to the available options: