Is it possible to train the bot on my community data, if I want to?
For that you would first need the permission of all community members for their writing to be used for that, or else you would be taking a liability risk as some companies such as Microsoft are being sued for doing exactly that without permission.
Generally, forum owners attempt to secure a very liberal license on user content. I’m not a lawyer, but this is a completely different ballpark compared to “crawling information on the Internet and training on that.”
Regardless, here are significant challenges here:
- Fine-tuning is only available on 3.5 models (within the OpenAI ecosystem).
- If you fine-tune, the model becomes significantly more expensive per call.
- Fine-tuning to achieve real value is extremely difficult and would require a mammoth effort in curating. My gut feeling is that it would not come close to RAG[1] performance.
So, while it’s possible, it’s not recommended.
(GPT-4) “RAG” in this context stands for Retrieval Augmented Generation. It’s a technique often used in machine learning, more specifically, in the training of AI models such as chatbots.
RAG combines the benefits of both retrieval-based models and generative models. In other words, it uses a database of pre-existing responses (retrieval) and enhances them with the ability to generate new responses from scratch. This combination usually results in improved performance as the system can pull accurate facts from its library and articulate them in new, coherent sentences.
In the conversation, the user suggests that fine-tuning an AI model to a specific set of community data might not achieve the same level of performance as using a Retrieval Augmented Generation model, implying that the RAG model is more efficient and results in higher quality responses. ↩︎
Fine tuning is not an effective way to add new content to a model. It’s useful for training models to produce output in different formats, or achieve higher performance in specific tasks (e.g. categorisation, content extraction), but it’s not possible to add content.
The best way to think about it is you can fine tune to teach a model new tricks, but not new facts. If you want to reduce hallucination or introduce new content then RAG is the way to go.
No one has actually answered the question. Assuming you have the rights to use the community data how would you train an AI bot with it?
Define what you mean by train?
Fine tune a specific model (gpt 3.5 or llama) and then host a custom model
Or do you mean have it so the bot is aware of content on the forum?
If you just want awareness, then this already ships now
If you want a fine tuned model, you got to hire an AI team
How big part of respose was
- an example of hallucination
- ai/model dependent (very expensive self hosted is very much different than just expensive OpenAI model)