How to prevent community content from being used to train LLMs like ChatGPT?

I steadfastly object to the continued moderation of my attempts to engage this topic is a deep and serious way, the slow stick is a joke, having to wait a hour each time.

There are a ton of wandering post that remain by many users. Consistency no, bias, hmmm, well that’s how it seems to thus user thus far and I don’t take things personally but the geriatric moderation stiffles to say the least.

I’m just trying to elevate this most serious and egregious situation at hand, and finally we get a excellent and serious post from @simon

Superb and bullseye exactly what the OP and others needed to hear first. Many options are buried in the extensive admin section, and I for one hadn’t noticed this feature/optin before. Now I can test it out, woudl be nice if it can have more custom information than the default card. Maybe custom text handles that does anyone know?

Much thanks. :+1:

If this article is right

You will need to remove your site from the open internet / block Google / enable login_required.

7 Likes

It’s worth noting that there’s absolutely nothing that requires a crawler to obey robots.txt and faking a user-agent is trivial. There are no laws governing these things. No amount of urgency or seriousness will change this. If you’re concerned about your data being used, all you can do is take your site private and wait for various legal proceedings regarding training data to pan out.

11 Likes

I expect sites that rely on advertising will see a drop in revenue and we’ll start seeing a lot more content behind paywalls. The quality of the free and open part of the internet will be diluted.

Discourse could actually capitalize on this trend by setting up a subscription service for its hosted customers.

6 Likes

There is already a subscriptions plugin that may be available to some tiers on hosted. Self-hosted sites can already adopt it.

The rub with hiding content is it will hit your SEO so it may depend what your new user funnel is.

I personally rely on search to pick up new users so put only a little content behind an account wall.

For many sites you still need to be discovered!

2 Likes

You seem to me to have two related concerns, @agemo, one being the use of AI in software, and the other being that ordinary people’s interactions on the web may be used to train AI. You are quite concerned about those things, and you want them not to happen.

I can understand that. I expect these concerns are shared by many.

Let me say, there are many things in the world which I am concerned about, and would like to have them be different - but I don’t bring them up here because they are not actionable by people here, or by Discourse as an offering. If I did keep bringing them up, it might be annoying and I might find myself moderated.

Perhaps you feel you are not being heard. But I think what’s really happening in this thread is that the others in this thread believe your concerns are not actionable, not actionable here or by them. Maybe something can be done, but it can’t be done by individuals here. Maybe the answer is a mass movement, a campaign, or a revolution - but I think it’s fair if the moderators here feel that such things are off-topic here.

3 Likes

It’s happened. The thing we can’t change. AI is unleashed now and is the event. I never suggested we could roll back time.

The mods thought they understood this topic they don’t, but they keep modding my contributions. I’m bored talking about the moderation, instead of the solutions, but they keep doing it or other users too, maybe they don’t see the value or are too comfy.

The reality is since my interventions to try pull this topic into a more solution based focus, despite the clumsy moderation, there has been some yield.

You might think you can’t do something, but looking at it and recognising that:

a) it’s serious
b) it’s urgent
c) it needs focus

Is a start, and that you have control over your reaction, but not the event which has happened and is now in the past and affects the present everyday into the foreseeable future.

There is no solution on offer other than crudely using solutions derived for other problems, and so it breaks the proposition, for the AI event is forcing people to assume positions that breaks their entire effort up to the point of the event.

It’s very natural not to want to be part of something that is a direct threat and will leverage your content in direct completion against all your efforts up to that point, for starters, but it doesn’t stop there.

I’m going to summarise the whole thing with one simple rhetorical question (you can argue if it’s rhetorical or not but you’ll have to acknowledge AI).

Why would anyone even consider deploying an instance of discourse (or similar) now?

There are so many concerns with this issue, sometimes one subject (OP) exemplifies the whole universe of the consequences of the problem, and this is certainly one. It should not get so narrow, especially when Discourse has no real solution to offer, then the topic by the very nature in the context is wide open or it’s “sorry since there is no solution for this this topic, the topic is now closed”, pick.

Open or close it.

Are we getting this?

This is the point. If there is an acknowledgment that there is no will to address the issue, then do so, otherwise this topic remains and needs to be very wide, that is the level of moderation attenuation required on this subject, because it’s virgin territory.

If there so happened to be a checkbox or two that fixes it in settings, we all go home, but there really is none, yet. There might be some stop gaps, but they are not in the realm of “SOLVED”, on that I think everyone agrees.

Since no solutions have been built in direct response to address the concerns of the OP and the issue AI and how an admin needs to manage it, then my points stand.

If there are, please point them out, post them here or the solution under development or whatever. Are we getting this?

Therein lies a responsibility, of a developer, of a user, and the existent relationship that make it all work. So we discuss it. Over and over if it requires.

What I see is zero acknowledgment of how this breaks until the last couple of posts since the OP started back in May and these I celebrated but was moderated for. That’s a Joke. AI is actually breaking the net, again, why bother setting up a discourse or similar platform? If we can’t discuss it in a serious genuine robust manner that fits teh demands of the subject, then there is your answer.

The market is moving, all the money, eyeballs and mania is falling head over heals into OpenAI’s & Co’s pockets. I see developers all over this here like everywhere else stepping up and choosing complete adoption and integration of AI with zero circumspection, ZERO!

This is why such an OP remains cornering and frustrating. Break you discourse is the only sure fire solution. Which is not a solution. It’s virtually a game over.

My analogy to how AI is being reacted to by developers, rhetorically: nearly all seem to be busy building all kinds of cool buckets to collect the lava from the Volcano eruption(the eruption being the event) and the reaction building bucket to collect lava, the lava is a gift fron the Volcano god, it brings heat and light yes, but it also burns through things very fast, and without the bucket you can not control the bit you have, but the bucket hides this fact, it seems safe, cool, neat, for now.

No. That would not be correct. I’ve outlined why the moderators have gotten it wrong and how it is way more serious than they countenance, and this could be rather disappointedly symptomatic of the top down position of the relationship between Discourse and AI… it feels like it’s either meh or shoulder shrug, but feelings can be wrong so prove me wrong with point of fact.

Some people have understood my points, or they looked harder at the OP at least, made some better contributions, which I am thankful for as they led me down a few potential paths to a very crude multi point solution, still a work in progress, and would require some recognition by devs to better map to the demands AI has raised, to make it better as a feasible live, but still stop gap measure.

It’s been tough decade or so for forums online form traffic to revenue declines. The implications of this event break those charts of dismay and for many operators may spell nothing short of final doom event and they’ll simply shut up shop.

I want to have discussions on certain topics, so I run forums where people can meet and discuss those topics. I chose Discourse, but in my opinion any other solution on the open web would have the same risks and the same results. I want my discussions to be on the open web and to show up in search results.

People can and do have interactions on privacy-conscious platforms like telegram and signal, but those are different kinds of offerings built for different reasons. It’s possible that Discourse chat might offer some of what you want - as it happens, I have no interest in that.

2 Likes

Yea this is a very wide rhetorical point and I think you missed the implication.

Let me go out on a limb and say your logical process was conducted during a time which was pre-current iteration of AI/ChatGPT iteration, and that was the old normal parading space.

People in the same space today have this new attention grabbing paradigm shifting new show in town (AI) that appears to promise a seemingly infinite x-factor in terms of potential and consequences, both in equal measure.

All previous activity and assumptions informing past decision become null and void if the AI has had access to it all, and there is enough anecdotally found online to suggest the data scrapping to feed the AI has been going on for 3 maybe 5 or more years, in the case of Deep mind maybe as early as 2014 when Google purchased it (maybe a foresenic sifting of log samples could prove this, or maybe it’s been occulted to prevent this). If you factor this in to be relatively true, you can see the problem is stark in technical lead-in times.

All the content may all be scrapped and it’s too late, but I’ve factored that into my concerns and representations, and I’m only making note of it here, because as I stated, there is no time machine solution here, only the power of circumspection to inform present and future solutions.

Sorry, I don’t understand any of that.

The implication of the question was there is now a compelling new choice in town that is seen a solution above all others for many needs, that being AI (ChatGPT powered tech).

Are you saying that no-one would choose to setup a forum because LLMs offer people everything they want from forums? (That isn’t the topic of this thread, BTW.)

(If you want people to do something for you, I think you need to be clear in what you think the problem is, and what you think they can do for you. I’m seeing that you care deeply, but I don’t know what you want. As with anyone, I have limited time and energy, so I’m not going to work hard at figuring out your thoughts.)

Edit to add:

the present "AI" summary of this thread, for posterity

A forum discussion on preventing community content from being used to train language models like ChatGPT centers around making content private by requiring login, blocking scrapers via robots.txt or the blocked crawler user agents Discourse setting, or removing the site from the open internet altogether. While some disagree with preventing the use of public data and believe it is an inevitable part of progress, others argue that content creators should have more control over how their work is used. The discussion explores the philosophical issues around ownership of information and creativity as well as provides practical tips for mitigating the use of data by AI systems.

4 Likes

Suddenly there is a new reason not to choose the old ways that is hard for most to resist.


I’m not the OP but I empathise for the OP even more now.

  1. Take the OP seriously, which no one was doing,

and

  1. The Why being, with all events like this, they have profoundly positive and negative repercussions, and I don’t think or detect any serious recognition of the downsides, and a bias to the perceived upsides, and thus there is no activity to evaluate and mitigate, i.e. support those affected by but at a platform level.

Once again, I’m not the OP but the OP problem is every discourses problem (that is public facing). It’s also a systemic existential threat to the net, it’s platform agnostic, or

it’s nothing more than “cool new toys” to pragmatically play with.

The later is not serious in the context. It’s purposefully blind. I personally find it irresponsible. Which is what makes the AI paradigm even more dangerous.

Single topics wont’ solve this it’s leadership. I started out by @ sam and @ codinghorror and that’s when all the moderation cataclysm began, done once not abused, but you know, other people think better, and know best, wait till the Ai really get’s it’s hooks in. :melting_face:

Bottom line: This issue needs to be taken very seriously.

So it may need it’s own category. It’s that huge.

So far apart from the solution that is not a solution but a breaking, if the strategy is lock the door with - login_required (setting), then in that scenario, to mitigate the negative traffic hit effects, if you rely on search traffic, is to have something to see but not everything.

WP frontend / Discourse login_required site
(more work, more hosting costs, support etc.)

Things that would also help but aren’t built with exactly this problem in mind:

Published Pages if developed with a dedicated listing page, some options to configure, could act as a bridging landing page where users can see some public front content with a register to read more prompt

– allow published page listing on own page /pub (make home page)
– allow published pages listed on login_require page
– allow custom category or latest on login_required page

I only found Published Pages a couple of days ago as a feature while trying to find a solution to this problem, and iirc even before the AI conundrum previous users have requested similar listing feature for published pages.

A more configurable purposed treatment of published pages is to my mind more preferential than a whole WP frontend bolt on, if needing to resolve some connection point that is public facing.

List Topic First Post only

Show only the first post of any topic and require login to read comments. I’ve seen similar suggested at least once and given the thumbs down but in this context it require re-evaluating.

Also regard these suggestions as an incomplete list, merely potential band-aids for part of, and not all of the problem.


Meanwhile I’ll revert to terrorising this topic with loads of feelz :slight_smile: How are we all feeling about ChatGPT and other LLMs and how they'll impact forums?

1 Like

From your last reply, I see that we’re coming to more or less the same conclusion of dealing with the issue by having a mix of public and private content. I wrote the post below before reading your reply. I’ll publish it anyway to try and help make the case.

I take the OP seriously, both because it asks a legitimate question, and because I may share a concern with its author about how LLMs are going to affect the internet. If I understand your concerns correctly, I think I agree with you that we’re witnessing a fundamental change in how the internet works - instead of people visiting sites directly, LLMs are going to become the go to interface for interacting with the public part of the internet. There are all sorts of implications to this that probably can’t be usefully dealt with here.

What can be addressed here is the question about how to prevent Discourse content from being used to train LLMs. Discourse provides a few possible approaches.

The first approach is a weak one - keep the site public and try to block any user agents that are being used to scrape data with the blocked crawler user agents site setting. Along with doing this, you could get involved with legal challenges against the tech companies that are scraping the data.

The stronger approach is to make all, or parts of your site private. This can be done with the login required site setting, or with category security settings.

The main objection I’m seeing to the above approach is that people want their sites to be discoverable by search engines. I suspect there are ways of dealing with this. The easiest would be to have a public SEO optimized blog, associated with a private Discourse forum. A more complex solution would be for Discourse to provide functionality that allowed part of a topic’s OP to be public, while the bulk of the topic could only be accessed by members of a Discourse group. This would be similar to how services like Substack deals with content that’s only available to paid subscribers - they display some content that’s accessible to anonymous users and crawlers, then display a signup CTA:

So I guess that along with my concern about how LLMs are going to impact the internet, I’m seeing an opportunity to look at new ways of funding content creators.

6 Likes

Where is this setting at?

2 Likes

Your question is “why would anyone produce anything that could be put on the public internet?”

When you ask the question on the public internet, no one who shares your view can answer your question.

5 Likes

This topic is draining, ai based summary covers the topic just fine, scroll to top and click it

Closing for the next 3 months

9 Likes

This topic was automatically opened after 90 days.