AI机器人可以在社区数据上进行培训吗

DjangoElBongo · 2024 年2 月 5 日 09:49

如果我想的话，是否有可能使用我的社区数据来训练机器人？

anon36555649 · 2024 年2 月 5 日 18:09

否则，您将面临法律风险，因为像微软这样的公司因未经许可而这样做而被起诉。

sam · 2024 年2 月 5 日 22:25

通常，论坛所有者会尝试为用户内容获得非常宽松的许可。我不是律师，但这与“抓取互联网上的信息并进行训练”完全是两码事。

无论如何，这里存在严峻的挑战：

微调仅在 3.5 模型（在 OpenAI 生态系统内）上可用。
如果进行微调，模型的每次调用成本将显著增加。
要实现真正的价值，微调极其困难，需要付出巨大的策展努力。我的直觉是，它甚至无法接近 RAG[^1] 的性能。

因此，虽然这是可能的，但不推荐这样做。

[^1]：（GPT-4）在此上下文中，“RAG”代表检索增强生成。这是一种常用于机器学习的技术，更具体地说，用于训练人工智能模型（如聊天机器人）。
RAG 结合了基于检索的模型和生成模型的优点。换句话说，它使用预先存在的响应数据库（检索），并通过从头开始生成新响应的能力来增强它们。这种组合通常会提高性能，因为系统可以从其库中提取准确的事实，并用新的、连贯的句子表达它们。
在对话中，用户建议微调 AI 模型以适应特定的社区数据集可能无法达到与使用检索增强生成模型相同的性能水平，这意味着 RAG 模型更有效，并且能产生更高质量的响应。

bigfudge · 2024 年2 月 28 日 09:10

微调并不是向模型添加新内容的有效方法。它有助于训练模型以不同格式生成输出，或在特定任务（例如分类、内容提取）中实现更高的性能，但无法添加内容。

最好的理解方式是，你可以进行微调来教会模型新的技巧，而不是新的事实。如果你想减少幻觉或引入新内容，那么 RAG 是最佳选择。

luis_villanueva · 2024 年3 月 19 日 03:43

没有人真正回答这个问题。假设您有权使用社区数据，您将如何用它来训练 AI 机器人？

sam · 2024 年3 月 19 日 04:26

请定义您所说的“训练”是什么意思？
微调特定模型（gpt 3.5 或 llama），然后托管自定义模型
或者您的意思是让机器人了解论坛上的内容？
如果您只是想要了解，那么现在已经可以实现了。
如果您想要一个微调模型，您需要聘请一个人工智能团队。

sam · 2024 年3 月 19 日 04:34

AI Conversation with Claude-3-opus

AI Training on Community Data: Discourse Meta

sam: Regarding First read the topic, next search for related topics, finally provide an insightful and helpful reply referencing stuff on meta

ForumHelperBot: Here are some relevant topics I found that discuss training AI chatbots on Discourse community data: Advice on a support bot for a technical support forum (Discourse AI vs Discourse Chatbot) - This topic discusses the goals and approaches for training an AI bot to assist with technical support on a Discourse forum, including learning from the historical discussions. Training a model on a site…

sam: Bro is this not what your are doing here?

ForumHelperBot: You’re right, I misspoke earlier. There are some topics that indicate the Discourse AI Bot can be trained on community specific data: Training a model on a site’s posts? - The OP suggests training an expert AI user bot on all the posts in their forum. Sam from the Discourse team replied that while training models is very expensive, the industry is pushing hard on growing token context and using …

...

Read full transcript

Jagster · 2024 年3 月 19 日 08:08

回应中有多大一部分是

幻觉的例子
依赖于人工智能/模型（非常昂贵的自托管与仅仅昂贵的 OpenAI 模型有很大不同）

markschmucker · 2024 年12 月 4 日 13:48

除了整个论坛外，我还能告诉 AI 机器人使用特定的数据源进行 RAG 吗？我有一个平面数据库（可以将其制成 csv），它将是许多 AI 机器人问题的良好起点，然后它可以根据需要 RAG 论坛的其余部分。我不想将 csv 发布到论坛上。

pfaffman · 2024 年12 月 4 日 15:04

也许您可以将 CSV 数据放在 AI 可以看到但普通用户无法看到的地方。（这可能是不可行的，或者是个糟糕的主意，但也许会奏效）。

Falco · 2024 年12 月 4 日 15:37

是的，您可以直接将文件上传到某个角色。

markschmucker · 2024 年12 月 5 日 21:55

听起来很完美，但我不知道如何上传到 persona。或者我应该将文件的链接放在 persona 的系统提示中？

Falco · 2024 年12 月 5 日 21:57

当您转到“管理”>“插件”>“AI”>“角色”>“新建角色”时，“上传”部分将位于最底部：

此外，根据 CSV 中的数据量，您也可以直接将其粘贴到系统提示中。

markschmucker · 2024 年12 月 6 日 10:14

我没有那个选项。3.4.0.beta3-dev。

joo · 2024 年12 月 6 日 10:29

我目前使用的是最新版本，也看不到那个选项。

sam · 2024 年12 月 6 日 20:23

嵌入是否已配置？

sam · 2024 年12 月 7 日 00:52

2 篇帖子已拆分为新主题：Gemini Embeddings 无法正常工作

话题		回复	浏览量
How to prevent community content from being used to train LLMs like ChatGPT? Community Building	69	5054	2023 年7 月 16 日
RAG capacities of discourse-ai Support ai	6	496	2024 年9 月 19 日
Training a model on a site's posts? Feature ai , ai-bot	2	329	2024 年9 月 9 日
Integrating GPT3-like bots? Development	61	5001	2023 年5 月 10 日
Advice on a support bot for a technical support forum (Discourse AI vs Discourse Chatbot) General ai , ai-bot	57	4588	2026 年4 月 29 日

AI机器人可以在社区数据上进行培训吗

相关话题