How to prevent community content from being used to train LLMs like ChatGPT?

Is there agreement that making a Category, essentially PRIVATE is a sure way to block not only all bots but the LLM or let’s call 'em AI bots?

Honestly from engaging the issue in at least one topic and also searching ChatGPT search Discourse like many other software offering out there are not taking the threat and destructive aspects of ChatGPT seriously IMHO, there needs to be some serious thinking about offering support and features for site owners and admin who do not wish to use any AI.

ChatGPT and all that it is synonymous for is one of those situations where the fuse has been lit, on both ends. :melting_face:

1 Like

That is a pretty surefire way yes.
Completely nefarious actors will still be able to register of course, but it should remove all legitimate crawlers.

Note: I deleted your comment where you tagged a few cofounders, that seems excessive.

6 Likes

Time tells us it’s not excessive. Heads need to wake up. I see a bias that engendering a big blindspot, this is an industry wide obervation too, but afaict Discourse seems no different.

If the only option is to make private your entire forum again, eh the “market” has changed not just one way, but many many ways so fundamentally it needs to be countenanced at some level operationally.

What’s excessive is ChatGPT and it’s effects, rapacious doesn’t describe the half of what’s going on, everywhere.

This fundamentally undermines every single forum and all human created content. You might be comfy now playing with the niceties and philosophical what if’s but that time has past. The thing is in the wild now. Decision need to be made by everyone with a toe in the world wide web waters.

What I said is excessive (and what I deleted) is pinging two cofounders/ceo for followup a mere 17 hours after asking a purely technical question.

Your fears about LLMs are real and understandable even if I disagree with them.

7 Likes

I understand that but you fail to understand the urgency, that a technical questions answer has such profound outcomes and consequences that are anything but technical in human terms.

So many implications yet everyone sleepwalking, indicative of the lack of concern at all levels.

Thanks for that answer.

Is all we got a hammer to crack a nut or is the nut actually a zero-point infinity nut and our hammer is really a figment of a feathers imagination.

Does that make sense? :wink:

I think you understand.

If your site allows anonymous users to read information then you have no control over who gets that information or what they will do with it. My understanding is that Google just changed their policy to say that anything they can read, they can use for their ai.

If your site allows logged in users to read your site you have no control over what those users will do with it.

If your site allows users to log in you don’t necessarily know that the person using the credentials is the person who created the account. If you want to be sure that no one can use your data in an ai then you can simply unplug its network connection.

6 Likes

There is small control when using reverse proxy - until they change or are using false user agent (or they are using widely IP-addresses, but that way is hard and rocky).

Let me know if you manage to develop a magical book that human eyes can see but no camera on earth can photograph

Very curious about this magical tech

As to the forum you are hosting on the Discourse platform, your forum / your rules. Some rules can be automatically enforced others can not (eg: people with blue eyes may not read this forum)

7 Likes

No one is really taking this seriously because I think no one wants to admit and grasp the true scale of this event, and then have to actually try do something about it within their domain of control, and it’s easier to join the race to the end, and incorporate AI into their software, thinking they are performing as the market expects and being on the cutting edge, being vital. This is where the last few decades of excessive moral relativism being allowed free reign at every level enables the great undoing of things and technology makes it happen at lighting speed, because it’s like as if;

everyone has forgotten the reason they are here.

I’m gonna slow this down a bit.

We hear your concerns, we just don’t share them and that’s ok. We can agree to differ. We are making informed decisions. No one is forcing those on you. :slight_smile:

8 Likes

@matenauta exactly

OpenAI have made use of a few datasets for training their models. The dataset that seems most likely to include Discourse content is a filtered version of the Common Crawl dataset. See section 2.2 of this document for details: https://arxiv.org/pdf/2005.14165.pdf. Common Crawl use the CCBot/2.0 user-agent string when crawling a site.

If you would like to keep your Discourse site accessible to the public, but prevent its content from being added to the Common Crawl dataset in the future, you can add CCBot to your Discourse site’s blocked crawler user agents setting. Note that there could be a downside to blocking the Common Crawl user agent (How to Block OpenAI ChatGPT From Using Your Website Content):

Many datasets, including Common Crawl, could be used by companies that filter and categorize URLs in order to create lists of websites to target with advertising.

Discourse’s use of the blocked crawler user agents setting is here: discourse/lib/crawler_detection.rb at main · discourse/discourse · GitHub.

Note that Common Crawl respect rules in the robots.txt file, so it could also be blocked by adding the following rule to the file:

User-agent: CCBot
Disallow: /

ChatGPT plugins use the ChatGPT-User user agent when making requests on behalf of users. This user agent is not used for crawling the web to create training datasets: https://platform.openai.com/docs/plugins/bot. This user agent could also be blocked by adding it to the blocked crawler user agents setting (or by adding a Disallow rule to the robots.txt file.)

As others have noted, the most reliable way to prevent your site from being used to train LLMs would be to prevent anonymous access to the site by enabling the login required site setting. To further harden off the site, steps could be taken to increase the likelihood that users on your site are human, and not bots. A possible approach to that would be to integrate a service like Gitcoin Passport with the site’s authentication system. I believe that an open source Gitcoin Passport plug-in for Discourse is going to be developed soon.

There may be other less technical ways of increasing the likelihood that users on the site are human. For example, the site could be set to invite only and steps could be taken to make sure you are only inviting users you have reason to believe are human to the site.

I find the philosophy behind all this super interesting, but I’m not going to get into it in this topic.

14 Likes

I steadfastly object to the continued moderation of my attempts to engage this topic is a deep and serious way, the slow stick is a joke, having to wait a hour each time.

There are a ton of wandering post that remain by many users. Consistency no, bias, hmmm, well that’s how it seems to thus user thus far and I don’t take things personally but the geriatric moderation stiffles to say the least.

I’m just trying to elevate this most serious and egregious situation at hand, and finally we get a excellent and serious post from @simon

Superb and bullseye exactly what the OP and others needed to hear first. Many options are buried in the extensive admin section, and I for one hadn’t noticed this feature/optin before. Now I can test it out, woudl be nice if it can have more custom information than the default card. Maybe custom text handles that does anyone know?

Much thanks. :+1:

If this article is right

You will need to remove your site from the open internet / block Google / enable login_required.

7 Likes

It’s worth noting that there’s absolutely nothing that requires a crawler to obey robots.txt and faking a user-agent is trivial. There are no laws governing these things. No amount of urgency or seriousness will change this. If you’re concerned about your data being used, all you can do is take your site private and wait for various legal proceedings regarding training data to pan out.

11 Likes

I expect sites that rely on advertising will see a drop in revenue and we’ll start seeing a lot more content behind paywalls. The quality of the free and open part of the internet will be diluted.

Discourse could actually capitalize on this trend by setting up a subscription service for its hosted customers.

6 Likes

There is already a subscriptions plugin that may be available to some tiers on hosted. Self-hosted sites can already adopt it.

The rub with hiding content is it will hit your SEO so it may depend what your new user funnel is.

I personally rely on search to pick up new users so put only a little content behind an account wall.

For many sites you still need to be discovered!

2 Likes

You seem to me to have two related concerns, @agemo, one being the use of AI in software, and the other being that ordinary people’s interactions on the web may be used to train AI. You are quite concerned about those things, and you want them not to happen.

I can understand that. I expect these concerns are shared by many.

Let me say, there are many things in the world which I am concerned about, and would like to have them be different - but I don’t bring them up here because they are not actionable by people here, or by Discourse as an offering. If I did keep bringing them up, it might be annoying and I might find myself moderated.

Perhaps you feel you are not being heard. But I think what’s really happening in this thread is that the others in this thread believe your concerns are not actionable, not actionable here or by them. Maybe something can be done, but it can’t be done by individuals here. Maybe the answer is a mass movement, a campaign, or a revolution - but I think it’s fair if the moderators here feel that such things are off-topic here.

3 Likes

It’s happened. The thing we can’t change. AI is unleashed now and is the event. I never suggested we could roll back time.

The mods thought they understood this topic they don’t, but they keep modding my contributions. I’m bored talking about the moderation, instead of the solutions, but they keep doing it or other users too, maybe they don’t see the value or are too comfy.

The reality is since my interventions to try pull this topic into a more solution based focus, despite the clumsy moderation, there has been some yield.

You might think you can’t do something, but looking at it and recognising that:

a) it’s serious
b) it’s urgent
c) it needs focus

Is a start, and that you have control over your reaction, but not the event which has happened and is now in the past and affects the present everyday into the foreseeable future.

There is no solution on offer other than crudely using solutions derived for other problems, and so it breaks the proposition, for the AI event is forcing people to assume positions that breaks their entire effort up to the point of the event.

It’s very natural not to want to be part of something that is a direct threat and will leverage your content in direct completion against all your efforts up to that point, for starters, but it doesn’t stop there.

I’m going to summarise the whole thing with one simple rhetorical question (you can argue if it’s rhetorical or not but you’ll have to acknowledge AI).

Why would anyone even consider deploying an instance of discourse (or similar) now?

There are so many concerns with this issue, sometimes one subject (OP) exemplifies the whole universe of the consequences of the problem, and this is certainly one. It should not get so narrow, especially when Discourse has no real solution to offer, then the topic by the very nature in the context is wide open or it’s “sorry since there is no solution for this this topic, the topic is now closed”, pick.

Open or close it.

Are we getting this?

This is the point. If there is an acknowledgment that there is no will to address the issue, then do so, otherwise this topic remains and needs to be very wide, that is the level of moderation attenuation required on this subject, because it’s virgin territory.

If there so happened to be a checkbox or two that fixes it in settings, we all go home, but there really is none, yet. There might be some stop gaps, but they are not in the realm of “SOLVED”, on that I think everyone agrees.

Since no solutions have been built in direct response to address the concerns of the OP and the issue AI and how an admin needs to manage it, then my points stand.

If there are, please point them out, post them here or the solution under development or whatever. Are we getting this?

Therein lies a responsibility, of a developer, of a user, and the existent relationship that make it all work. So we discuss it. Over and over if it requires.

What I see is zero acknowledgment of how this breaks until the last couple of posts since the OP started back in May and these I celebrated but was moderated for. That’s a Joke. AI is actually breaking the net, again, why bother setting up a discourse or similar platform? If we can’t discuss it in a serious genuine robust manner that fits teh demands of the subject, then there is your answer.

The market is moving, all the money, eyeballs and mania is falling head over heals into OpenAI’s & Co’s pockets. I see developers all over this here like everywhere else stepping up and choosing complete adoption and integration of AI with zero circumspection, ZERO!

This is why such an OP remains cornering and frustrating. Break you discourse is the only sure fire solution. Which is not a solution. It’s virtually a game over.

My analogy to how AI is being reacted to by developers, rhetorically: nearly all seem to be busy building all kinds of cool buckets to collect the lava from the Volcano eruption(the eruption being the event) and the reaction building bucket to collect lava, the lava is a gift fron the Volcano god, it brings heat and light yes, but it also burns through things very fast, and without the bucket you can not control the bit you have, but the bucket hides this fact, it seems safe, cool, neat, for now.

No. That would not be correct. I’ve outlined why the moderators have gotten it wrong and how it is way more serious than they countenance, and this could be rather disappointedly symptomatic of the top down position of the relationship between Discourse and AI… it feels like it’s either meh or shoulder shrug, but feelings can be wrong so prove me wrong with point of fact.

Some people have understood my points, or they looked harder at the OP at least, made some better contributions, which I am thankful for as they led me down a few potential paths to a very crude multi point solution, still a work in progress, and would require some recognition by devs to better map to the demands AI has raised, to make it better as a feasible live, but still stop gap measure.

It’s been tough decade or so for forums online form traffic to revenue declines. The implications of this event break those charts of dismay and for many operators may spell nothing short of final doom event and they’ll simply shut up shop.

I want to have discussions on certain topics, so I run forums where people can meet and discuss those topics. I chose Discourse, but in my opinion any other solution on the open web would have the same risks and the same results. I want my discussions to be on the open web and to show up in search results.

People can and do have interactions on privacy-conscious platforms like telegram and signal, but those are different kinds of offerings built for different reasons. It’s possible that Discourse chat might offer some of what you want - as it happens, I have no interest in that.

2 Likes