How to prevent community content from being used to train LLMs like ChatGPT?

If this article is right

You will need to remove your site from the open internet / block Google / enable login_required.


It’s worth noting that there’s absolutely nothing that requires a crawler to obey robots.txt and faking a user-agent is trivial. There are no laws governing these things. No amount of urgency or seriousness will change this. If you’re concerned about your data being used, all you can do is take your site private and wait for various legal proceedings regarding training data to pan out.


I expect sites that rely on advertising will see a drop in revenue and we’ll start seeing a lot more content behind paywalls. The quality of the free and open part of the internet will be diluted.

Discourse could actually capitalize on this trend by setting up a subscription service for its hosted customers.


There is already a subscriptions plugin that may be available to some tiers on hosted. Self-hosted sites can already adopt it.

The rub with hiding content is it will hit your SEO so it may depend what your new user funnel is.

I personally rely on search to pick up new users so put only a little content behind an account wall.

For many sites you still need to be discovered!


You seem to me to have two related concerns, @agemo, one being the use of AI in software, and the other being that ordinary people’s interactions on the web may be used to train AI. You are quite concerned about those things, and you want them not to happen.

I can understand that. I expect these concerns are shared by many.

Let me say, there are many things in the world which I am concerned about, and would like to have them be different - but I don’t bring them up here because they are not actionable by people here, or by Discourse as an offering. If I did keep bringing them up, it might be annoying and I might find myself moderated.

Perhaps you feel you are not being heard. But I think what’s really happening in this thread is that the others in this thread believe your concerns are not actionable, not actionable here or by them. Maybe something can be done, but it can’t be done by individuals here. Maybe the answer is a mass movement, a campaign, or a revolution - but I think it’s fair if the moderators here feel that such things are off-topic here.


The reality is since my interventions to try pull this topic into a more solution based focus, despite the clumsy moderation, there has been some yield.

There is no solution on offer other than crudely using solutions derived for other problems, and so it breaks the proposition, for the AI event is forcing people to assume positions that breaks their entire effort up to the point of the event.

It’s very natural not to want to be part of something that is a direct threat and will leverage your content in direct completion against all your efforts up to that point, for starters, but it doesn’t stop there.

I want to have discussions on certain topics, so I run forums where people can meet and discuss those topics. I chose Discourse, but in my opinion any other solution on the open web would have the same risks and the same results. I want my discussions to be on the open web and to show up in search results.

People can and do have interactions on privacy-conscious platforms like telegram and signal, but those are different kinds of offerings built for different reasons. It’s possible that Discourse chat might offer some of what you want - as it happens, I have no interest in that.


People in the same space today have this new attention grabbing paradigm shifting new show in town (AI) that appears to promise a seemingly infinite x-factor in terms of potential and consequences, both in equal measure.

All previous activity and assumptions informing past decision become null and void if the AI has had access to it all, and there is enough anecdotally found online to suggest the data scrapping to feed the AI has been going on for 3 maybe 5 or more years, in the case of Deep mind maybe as early as 2014 when Google purchased it (maybe a foresenic sifting of log samples could prove this, or maybe it’s been occulted to prevent this). If you factor this in to be relatively true, you can see the problem is stark in technical lead-in times.

All the content may all be scrapped and it’s too late, but I’ve factored that into my concerns and representations, and I’m only making note of it here, because as I stated, there is no time machine solution here, only the power of circumspection to inform present and future solutions.

The implication of the question was there is now a compelling new choice in town that is seen a solution above all others for many needs, that being AI (ChatGPT powered tech).

Are you saying that no-one would choose to setup a forum because LLMs offer people everything they want from forums? (That isn’t the topic of this thread, BTW.)

(If you want people to do something for you, I think you need to be clear in what you think the problem is, and what you think they can do for you. I’m seeing that you care deeply, but I don’t know what you want. As with anyone, I have limited time and energy, so I’m not going to work hard at figuring out your thoughts.)

the present "AI" summary of this thread, for posterity

A forum discussion on preventing community content from being used to train language models like ChatGPT centers around making content private by requiring login, blocking scrapers via robots.txt or the blocked crawler user agents Discourse setting, or removing the site from the open internet altogether. While some disagree with preventing the use of public data and believe it is an inevitable part of progress, others argue that content creators should have more control over how their work is used. The discussion explores the philosophical issues around ownership of information and creativity as well as provides practical tips for mitigating the use of data by AI systems.


So far apart from the solution that is not a solution but a breaking, if the strategy is lock the door with - login_required (setting), then in that scenario, to mitigate the negative traffic hit effects, if you rely on search traffic, is to have something to see but not everything.

WP frontend / Discourse login_required site
(more work, more hosting costs, support etc.)

Things that would also help but aren’t built with exactly this problem in mind:

Published Pages if developed with a dedicated listing page, some options to configure, could act as a bridging landing page where users can see some public front content with a register to read more prompt

– allow published page listing on own page /pub (make home page)
– allow published pages listed on login_require page
– allow custom category or latest on login_required page

I only found Published Pages a couple of days ago as a feature while trying to find a solution to this problem, and iirc even before the AI conundrum previous users have requested similar listing feature for published pages.

A more configurable purposed treatment of published pages is to my mind more preferential than a whole WP frontend bolt on, if needing to resolve some connection point that is public facing.

List Topic First Post only

Show only the first post of any topic and require login to read comments. I’ve seen similar suggested at least once and given the thumbs down but in this context it require re-evaluating.

Also regard these suggestions as an incomplete list, merely potential band-aids for part of, and not all of the problem.

Also regard these suggestions as an incomplete list, merely potential band-aids for part of, and not all of the problem.

From your last reply, I see that we’re coming to more or less the same conclusion of dealing with the issue by having a mix of public and private content. I wrote the post below before reading your reply. I’ll publish it anyway to try and help make the case.

I take the OP seriously, both because it asks a legitimate question, and because I may share a concern with its author about how LLMs are going to affect the internet. If I understand your concerns correctly, I think I agree with you that we’re witnessing a fundamental change in how the internet works - instead of people visiting sites directly, LLMs are going to become the go to interface for interacting with the public part of the internet. There are all sorts of implications to this that probably can’t be usefully dealt with here.

What can be addressed here is the question about how to prevent Discourse content from being used to train LLMs. Discourse provides a few possible approaches.

The first approach is a weak one - keep the site public and try to block any user agents that are being used to scrape data with the blocked crawler user agents site setting. Along with doing this, you could get involved with legal challenges against the tech companies that are scraping the data.

The stronger approach is to make all, or parts of your site private. This can be done with the login required site setting, or with category security settings.

The main objection I’m seeing to the above approach is that people want their sites to be discoverable by search engines. I suspect there are ways of dealing with this. The easiest would be to have a public SEO optimized blog, associated with a private Discourse forum. A more complex solution would be for Discourse to provide functionality that allowed part of a topic’s OP to be public, while the bulk of the topic could only be accessed by members of a Discourse group. This would be similar to how services like Substack deals with content that’s only available to paid subscribers - they display some content that’s accessible to anonymous users and crawlers, then display a signup CTA:

So I guess that along with my concern about how LLMs are going to impact the internet, I’m seeing an opportunity to look at new ways of funding content creators.


Where is this setting at?


