How to prevent community content from being used to train LLMs like ChatGPT?

OpenAI have made use of a few datasets for training their models. The dataset that seems most likely to include Discourse content is a filtered version of the Common Crawl dataset. See section 2.2 of this document for details: https://arxiv.org/pdf/2005.14165.pdf. Common Crawl use the CCBot/2.0 user-agent string when crawling a site.

If you would like to keep your Discourse site accessible to the public, but prevent its content from being added to the Common Crawl dataset in the future, you can add CCBot to your Discourse site’s blocked crawler user agents setting. Note that there could be a downside to blocking the Common Crawl user agent (How to Block OpenAI ChatGPT From Using Your Website Content):

Many datasets, including Common Crawl, could be used by companies that filter and categorize URLs in order to create lists of websites to target with advertising.

Discourse’s use of the blocked crawler user agents setting is here: discourse/lib/crawler_detection.rb at main · discourse/discourse · GitHub.

Note that Common Crawl respect rules in the robots.txt file, so it could also be blocked by adding the following rule to the file:

User-agent: CCBot
Disallow: /

ChatGPT plugins use the ChatGPT-User user agent when making requests on behalf of users. This user agent is not used for crawling the web to create training datasets: https://platform.openai.com/docs/plugins/bot. This user agent could also be blocked by adding it to the blocked crawler user agents setting (or by adding a Disallow rule to the robots.txt file.)

As others have noted, the most reliable way to prevent your site from being used to train LLMs would be to prevent anonymous access to the site by enabling the login required site setting. To further harden off the site, steps could be taken to increase the likelihood that users on your site are human, and not bots. A possible approach to that would be to integrate a service like Gitcoin Passport with the site’s authentication system. I believe that an open source Gitcoin Passport plug-in for Discourse is going to be developed soon.

There may be other less technical ways of increasing the likelihood that users on the site are human. For example, the site could be set to invite only and steps could be taken to make sure you are only inviting users you have reason to believe are human to the site.

I find the philosophy behind all this super interesting, but I’m not going to get into it in this topic.

14 Likes