Is it possible to train the bot on my community data, if I want to?
For that you would first need the permission of all community members for their writing to be used for that, or else you would be taking a liability risk as some companies such as Microsoft are being sued for doing exactly that without permission.
Generally, forum owners attempt to secure a very liberal license on user content. I’m not a lawyer, but this is a completely different ballpark compared to “crawling information on the Internet and training on that.”
Regardless, here are significant challenges here:
- Fine-tuning is only available on 3.5 models (within the OpenAI ecosystem).
- If you fine-tune, the model becomes significantly more expensive per call.
- Fine-tuning to achieve real value is extremely difficult and would require a mammoth effort in curating. My gut feeling is that it would not come close to RAG[1] performance.
So, while it’s possible, it’s not recommended.
(GPT-4) “RAG” in this context stands for Retrieval Augmented Generation. It’s a technique often used in machine learning, more specifically, in the training of AI models such as chatbots.
RAG combines the benefits of both retrieval-based models and generative models. In other words, it uses a database of pre-existing responses (retrieval) and enhances them with the ability to generate new responses from scratch. This combination usually results in improved performance as the system can pull accurate facts from its library and articulate them in new, coherent sentences.
In the conversation, the user suggests that fine-tuning an AI model to a specific set of community data might not achieve the same level of performance as using a Retrieval Augmented Generation model, implying that the RAG model is more efficient and results in higher quality responses. ↩︎
Fine tuning is not an effective way to add new content to a model. It’s useful for training models to produce output in different formats, or achieve higher performance in specific tasks (e.g. categorisation, content extraction), but it’s not possible to add content.
The best way to think about it is you can fine tune to teach a model new tricks, but not new facts. If you want to reduce hallucination or introduce new content then RAG is the way to go.
No one has actually answered the question. Assuming you have the rights to use the community data how would you train an AI bot with it?
Define what you mean by train?
Fine tune a specific model (gpt 3.5 or llama) and then host a custom model
Or do you mean have it so the bot is aware of content on the forum?
If you just want awareness, then this already ships now
If you want a fine tuned model, you got to hire an AI team
How big part of respose was
- an example of hallucination
- ai/model dependent (very expensive self hosted is very much different than just expensive OpenAI model)
Can I tell AI bot to use a specific data source for RAG, in addition to the whole forum? I have a flat database (I could make it csv) that would be a good starting point for many AI bot questions, then it could RAG the rest of the forum as needed. I don’t want to post the csv on the forum though.
Perhaps you could put the CSV data in a place that the AI could see it that the regular users cannot. (This could be impossible or a terrible idea, but it might work).
Yes, you can upload the file directly to a persona.
Sounds perfect, but I don’t see how to upload to a persona. Or should I put a link to the file in the persona’s system prompt?
When you go to Admin > Plugins > AI > Personas > New Persona there will be a Uploads are at the very end:
Also, depending on how many data you have on CSV, you can also directly paste it into the system prompt.
I’m currently using the latest version, and I don’t see that option either.
Is embedding configured?
2 posts were split to a new topic: Gemini Embeddings are not working