Can AI bot be trained on community data

DjangoElBongo · February 5, 2024, 9:49am

Is it possible to train the bot on my community data, if I want to?

anon36555649 · February 5, 2024, 6:09pm

For that you would first need the permission of all community members for their writing to be used for that, or else you would be taking a liability risk as some companies such as Microsoft are being sued for doing exactly that without permission.

sam · February 5, 2024, 10:25pm

Generally, forum owners attempt to secure a very liberal license on user content. I’m not a lawyer, but this is a completely different ballpark compared to “crawling information on the Internet and training on that.”

Regardless, here are significant challenges here:

Fine-tuning is only available on 3.5 models (within the OpenAI ecosystem).
If you fine-tune, the model becomes significantly more expensive per call.
Fine-tuning to achieve real value is extremely difficult and would require a mammoth effort in curating. My gut feeling is that it would not come close to RAG^[1] performance.

So, while it’s possible, it’s not recommended.

(GPT-4) “RAG” in this context stands for Retrieval Augmented Generation. It’s a technique often used in machine learning, more specifically, in the training of AI models such as chatbots.
RAG combines the benefits of both retrieval-based models and generative models. In other words, it uses a database of pre-existing responses (retrieval) and enhances them with the ability to generate new responses from scratch. This combination usually results in improved performance as the system can pull accurate facts from its library and articulate them in new, coherent sentences.
In the conversation, the user suggests that fine-tuning an AI model to a specific set of community data might not achieve the same level of performance as using a Retrieval Augmented Generation model, implying that the RAG model is more efficient and results in higher quality responses. ↩︎

bigfudge · February 28, 2024, 9:10am

Fine tuning is not an effective way to add new content to a model. It’s useful for training models to produce output in different formats, or achieve higher performance in specific tasks (e.g. categorisation, content extraction), but it’s not possible to add content.

The best way to think about it is you can fine tune to teach a model new tricks, but not new facts. If you want to reduce hallucination or introduce new content then RAG is the way to go.

luis_villanueva · March 19, 2024, 3:43am

No one has actually answered the question. Assuming you have the rights to use the community data how would you train an AI bot with it?

sam · March 19, 2024, 4:26am

Define what you mean by train?

Fine tune a specific model (gpt 3.5 or llama) and then host a custom model

Or do you mean have it so the bot is aware of content on the forum?

If you just want awareness, then this already ships now

If you want a fine tuned model, you got to hire an AI team

sam · March 19, 2024, 4:34am

AI Conversation with Claude-3-opus

AI Training on Community Data: Discourse Meta

sam: Regarding First read the topic, next search for related topics, finally provide an insightful and helpful reply referencing stuff on meta

ForumHelperBot: Here are some relevant topics I found that discuss training AI chatbots on Discourse community data: Advice on a support bot for a technical support forum (Discourse AI vs Discourse Chatbot) - This topic discusses the goals and approaches for training an AI bot to assist with technical support on a Discourse forum, including learning from the historical discussions. Training a model on a site…

sam: Bro is this not what your are doing here?

ForumHelperBot: You’re right, I misspoke earlier. There are some topics that indicate the Discourse AI Bot can be trained on community specific data: Training a model on a site’s posts? - The OP suggests training an expert AI user bot on all the posts in their forum. Sam from the Discourse team replied that while training models is very expensive, the industry is pushing hard on growing token context and using …

...

Read full transcript

Jagster · March 19, 2024, 8:08am

How big part of respose was

an example of hallucination
ai/model dependent (very expensive self hosted is very much different than just expensive OpenAI model)

markschmucker · December 4, 2024, 1:48pm

Can I tell AI bot to use a specific data source for RAG, in addition to the whole forum? I have a flat database (I could make it csv) that would be a good starting point for many AI bot questions, then it could RAG the rest of the forum as needed. I don’t want to post the csv on the forum though.

pfaffman · December 4, 2024, 3:04pm

Perhaps you could put the CSV data in a place that the AI could see it that the regular users cannot. (This could be impossible or a terrible idea, but it might work).

Falco · December 4, 2024, 3:37pm

Yes, you can upload the file directly to a persona.

markschmucker · December 5, 2024, 9:55pm

Sounds perfect, but I don’t see how to upload to a persona. Or should I put a link to the file in the persona’s system prompt?

Falco · December 5, 2024, 9:57pm

When you go to Admin > Plugins > AI > Personas > New Persona there will be a Uploads are at the very end:

Also, depending on how many data you have on CSV, you can also directly paste it into the system prompt.

markschmucker · December 6, 2024, 10:14am

I don’t have that option. 3.4.0.beta3-dev.

joo · December 6, 2024, 10:29am

I’m currently using the latest version, and I don’t see that option either.

sam · December 6, 2024, 8:23pm

Is embedding configured?

sam · December 7, 2024, 12:52am

2 posts were split to a new topic: Gemini Embeddings are not working

Topic		Replies	Views
How to prevent community content from being used to train LLMs like ChatGPT? Community	71	4010	October 14, 2023
RAG capacities of discourse-ai Support ai	7	198	September 19, 2024
Training a model on a site's posts? Feature ai , ai-bot	2	247	September 9, 2024
Advice on a support bot for a technical support forum (Discourse AI vs Discourse Chatbot) General ai , ai-bot	50	3540	September 19, 2024
Integrating GPT3-like bots? Dev	63	4336	May 10, 2023

Can AI bot be trained on community data

Related topics