How to prevent community content from being used to train LLMs like ChatGPT?

Yea this is a very wide rhetorical point and I think you missed the implication.

Let me go out on a limb and say your logical process was conducted during a time which was pre-current iteration of AI/ChatGPT iteration, and that was the old normal parading space.

People in the same space today have this new attention grabbing paradigm shifting new show in town (AI) that appears to promise a seemingly infinite x-factor in terms of potential and consequences, both in equal measure.

All previous activity and assumptions informing past decision become null and void if the AI has had access to it all, and there is enough anecdotally found online to suggest the data scrapping to feed the AI has been going on for 3 maybe 5 or more years, in the case of Deep mind maybe as early as 2014 when Google purchased it (maybe a foresenic sifting of log samples could prove this, or maybe it’s been occulted to prevent this). If you factor this in to be relatively true, you can see the problem is stark in technical lead-in times.

All the content may all be scrapped and it’s too late, but I’ve factored that into my concerns and representations, and I’m only making note of it here, because as I stated, there is no time machine solution here, only the power of circumspection to inform present and future solutions.

Sorry, I don’t understand any of that.

The implication of the question was there is now a compelling new choice in town that is seen a solution above all others for many needs, that being AI (ChatGPT powered tech).

Are you saying that no-one would choose to setup a forum because LLMs offer people everything they want from forums? (That isn’t the topic of this thread, BTW.)

(If you want people to do something for you, I think you need to be clear in what you think the problem is, and what you think they can do for you. I’m seeing that you care deeply, but I don’t know what you want. As with anyone, I have limited time and energy, so I’m not going to work hard at figuring out your thoughts.)

Edit to add:

the present "AI" summary of this thread, for posterity

A forum discussion on preventing community content from being used to train language models like ChatGPT centers around making content private by requiring login, blocking scrapers via robots.txt or the blocked crawler user agents Discourse setting, or removing the site from the open internet altogether. While some disagree with preventing the use of public data and believe it is an inevitable part of progress, others argue that content creators should have more control over how their work is used. The discussion explores the philosophical issues around ownership of information and creativity as well as provides practical tips for mitigating the use of data by AI systems.


Suddenly there is a new reason not to choose the old ways that is hard for most to resist.

I’m not the OP but I empathise for the OP even more now.

  1. Take the OP seriously, which no one was doing,


  1. The Why being, with all events like this, they have profoundly positive and negative repercussions, and I don’t think or detect any serious recognition of the downsides, and a bias to the perceived upsides, and thus there is no activity to evaluate and mitigate, i.e. support those affected by but at a platform level.

Once again, I’m not the OP but the OP problem is every discourses problem (that is public facing). It’s also a systemic existential threat to the net, it’s platform agnostic, or

it’s nothing more than “cool new toys” to pragmatically play with.

The later is not serious in the context. It’s purposefully blind. I personally find it irresponsible. Which is what makes the AI paradigm even more dangerous.

Single topics wont’ solve this it’s leadership. I started out by @ sam and @ codinghorror and that’s when all the moderation cataclysm began, done once not abused, but you know, other people think better, and know best, wait till the Ai really get’s it’s hooks in. :melting_face:

Bottom line: This issue needs to be taken very seriously.

So it may need it’s own category. It’s that huge.

So far apart from the solution that is not a solution but a breaking, if the strategy is lock the door with - login_required (setting), then in that scenario, to mitigate the negative traffic hit effects, if you rely on search traffic, is to have something to see but not everything.

WP frontend / Discourse login_required site
(more work, more hosting costs, support etc.)

Things that would also help but aren’t built with exactly this problem in mind:

Published Pages if developed with a dedicated listing page, some options to configure, could act as a bridging landing page where users can see some public front content with a register to read more prompt

– allow published page listing on own page /pub (make home page)
– allow published pages listed on login_require page
– allow custom category or latest on login_required page

I only found Published Pages a couple of days ago as a feature while trying to find a solution to this problem, and iirc even before the AI conundrum previous users have requested similar listing feature for published pages.

A more configurable purposed treatment of published pages is to my mind more preferential than a whole WP frontend bolt on, if needing to resolve some connection point that is public facing.

List Topic First Post only

Show only the first post of any topic and require login to read comments. I’ve seen similar suggested at least once and given the thumbs down but in this context it require re-evaluating.

Also regard these suggestions as an incomplete list, merely potential band-aids for part of, and not all of the problem.

Meanwhile I’ll revert to terrorising this topic with loads of feelz :slight_smile: How are we all feeling about ChatGPT and other LLMs and how they'll impact forums?

1 Like

From your last reply, I see that we’re coming to more or less the same conclusion of dealing with the issue by having a mix of public and private content. I wrote the post below before reading your reply. I’ll publish it anyway to try and help make the case.

I take the OP seriously, both because it asks a legitimate question, and because I may share a concern with its author about how LLMs are going to affect the internet. If I understand your concerns correctly, I think I agree with you that we’re witnessing a fundamental change in how the internet works - instead of people visiting sites directly, LLMs are going to become the go to interface for interacting with the public part of the internet. There are all sorts of implications to this that probably can’t be usefully dealt with here.

What can be addressed here is the question about how to prevent Discourse content from being used to train LLMs. Discourse provides a few possible approaches.

The first approach is a weak one - keep the site public and try to block any user agents that are being used to scrape data with the blocked crawler user agents site setting. Along with doing this, you could get involved with legal challenges against the tech companies that are scraping the data.

The stronger approach is to make all, or parts of your site private. This can be done with the login required site setting, or with category security settings.

The main objection I’m seeing to the above approach is that people want their sites to be discoverable by search engines. I suspect there are ways of dealing with this. The easiest would be to have a public SEO optimized blog, associated with a private Discourse forum. A more complex solution would be for Discourse to provide functionality that allowed part of a topic’s OP to be public, while the bulk of the topic could only be accessed by members of a Discourse group. This would be similar to how services like Substack deals with content that’s only available to paid subscribers - they display some content that’s accessible to anonymous users and crawlers, then display a signup CTA:

So I guess that along with my concern about how LLMs are going to impact the internet, I’m seeing an opportunity to look at new ways of funding content creators.


Where is this setting at?


Your question is “why would anyone produce anything that could be put on the public internet?”

When you ask the question on the public internet, no one who shares your view can answer your question.


This topic is draining, ai based summary covers the topic just fine, scroll to top and click it

Closing for the next 3 months


This topic was automatically opened after 90 days.