How to prevent community content from being used to train LLMs like ChatGPT?

This is somewhat exasperating.

I was using the term ‘similar’ somewhat loosely but definitely validly, only in regards to one concept and only to support a specific point. I thought that was obvious?

My point of stating similarity was limited to the concept of ‘feature’ extraction and matching, nothing else, in order to draw a distinction from learning concepts to memorising copy verbatim.

I’m fully aware there are significant differences as well.

You do know I know a human head doesn’t resemble a datacentre, right? :rofl:

Are you saying there is no feature extraction and matching going on in the human brain?

Because that is what it is doing:

“ Learning feature detectors
To enable the perceptual system to make the fine distinctions that are required to control behavior, sensory cortex needs an efficient way of adapting the synaptic weights of multiple layers of feature-detecting neurons.”

Also see Feature detection (nervous system) - Wikipedia

That’s a contradiction. It absolutely isn’t cut & paste and that is the crux of my point.

It’s arguably not even lossy compression:

Yes it can. And again, caveat :sweat_smile: , not to the extent we can.

ChatGPT is generalising. That is what pattern matching aka feature extraction is! It is able to configure words in a sensible order which match Grammar rules. It has ‘learned’ a complex set of features and is able to construct sentences that make grammatically sense whatever the subject area. It is not storing every possible combination of words and regurgitating exactly one match each time, ie not cut and paste! That’s just one demonstration. The responses it gives demonstrates emerging sophistication.

But sure it isn’t sophisticated enough to “understand” mathematics. Not yet. (and maybe not ever with this current technique?).

I fully recognise the level of sophistication isn’t matching the brain, that it’s limited in scope and the physical implementation of it all is very different. But that doesn’t invalidate my point…

… which was specific!

Next time I will be sure to painstakingly caveat my point to avoid this unnecessary noise. :sweat_smile:


As fascinating and discussion-worthy as the philosophy is, I think the OP is specifically looking for practical tips on how to mitigate this. Could we stay on topic and concentrate on those? :pray:


Fully agree! But we have drifted…

Indeed. There is a real risk of training data being exposed in the LLM output, and when it happens that can be a privacy problem or a copyright problem. I think the appropriate tools are on the one hand data protection law, and on the other hand copyright law, and therefore licensing.

I think it wouldn’t hurt to make the terms and conditions of use disallow certain acts, like data scraping, large scale download, inclusion in training data for machine learning. But for enforcement, I would suggest some clarity in licensing of the content. For effectiveness, some suitable clear license should be part of the default installation, so that most Discourse instances have the same approach to protecting themselves.

I would look to entities such as the EFF for templates of the right sort of policies.


Oh, something important to add. If you restrictively license the content of your forum, you might in the worst case make it difficult or impossible to migrate your forum to a new platform. Don’t do that!

(There’s a social aspect too, although it might be minor. If your forum terms say that a person’s contributions become the property of the forum, that will put some people off. But you need something: you don’t want users who leave to be able to insist that all their posts should be removed. This is a different problem to the topic here, but it shows that terms are important.)


In western countries at least such term is totally meaningless and it shows only one thing: owner of platform has absolut no knowledge.


The why is (very) interesting, tho.
Why do you want to know how to do it? To do it, admittedly.
But why? It is quite an extension of the question.

This is a good question. And the forum users themselves are actually becoming the books, here.

I guess one way, which seems to be done on many sites, is to analyze the user’s behavior. If “too many” pages are scanned, especially if it is done “too quickly”, then it’s probably scraping. Some parameters can then be added, like for example the use of a “hosting IP address” rather than an residential IP address, the fact a “headless” browser is used, cookies aren’t accepted, etc.

So yes, all this can be defined and fine-tuned going forward to try to technically block as much scraping as possible. The usual way of doing things is to ask for a CAPTCHA when bot-like behavior is suspected. This allows humans to continue, which wouldn’t be possible if the system was simply blocking the user.

Now, all this can always been circumvented if someone wants to still do it. By avoiding being identified and appearing as many different users, appearing more legitimate on many fronts, rotating residential IPs, etc. It’s almost a sport to know how to scrape what a system is designed to prevent you to. Some people are very good at it. There are plenty of resources out there to do so.

Legitimate entities like the people behind ChatGPT and such won’t probably go this route. They will also probably be more inclined to respect ToS, come with a straight user agent, etc. To discourage them, the “legal” and simple fact you say you forbid it may be enough. This won’t work with people less caring about legalities and straightforwardness.

A pretty simple solution is to restrict how much can be viewed as a guest without having to be logged in. But again, like often, you will have a very hard time preventing those who really want to do it if they’re motivated enough. The latter might not be the important people to target in this matter, tho.


I think that is controlled like any other crawler. There are settings to deny access by user agent. If the crawler uses a user agent indicating what they are doing, you can control it.

It’s not clear to me where GPT got its initial data set out whether or where it’ll get new data. You’d need to figure out what the user agents are, I think.


Does it work beyond robots.txt, on firewall level?


Diacussions on internet has days counted and that thread and response to my genuine question (in order to explore the how) is a clearly prelude.

Currently, there is no way to do that I am afraid, as the search is just a web wrapper.

User-agent: OpenAI Disallow: /


<meta name=’robots’ content=’noindex, nofollow’>


And it will follow that rule for sure?


I’d like to weigh in and say this is a great topic, barely makes the cut as to one allowed here from how I view it, but does

I’d say that sums it up nicely

lol, that’s getting into Skynet territory, will AI do its own thing

I’d like to offer an example on yes it will

Many religions are based on the Bible, and the Bible is based on the traditions of men

So yes, the created can surpass the creator.

Someday, if we’re not put to a stop, we could well be the books of a new Bible

You may all be disciples :hugs:


Its a tool or a toy until its not :man_shrugging:


A funny joke — but in the real world majority of bots don’t follow rules of robots.txt. It is just suggestion, not somekind firewall.


robots.txt are instructions intended for the crawlers themselves
It does rely on the assumption they will follow them. Nothing says it will be the case “for sure”.

You can block user agents at your web server level. Most often, NGINX is used with Discourse.
Here, your web server won’t serve any content to these user agents. It is done by adding a few lines to your website NGINX config file. Do a web search for nginx block user agent or similar search.

This is “for sure”, if the crawler shows an honest user agent.


Which is decidedly not “for sure”. :slight_smile:


It blocks for sure user agents you’re wanting to block :+1:
(EDIT to be :100: % clear: By using NGINX as presented above and not by just relying on robots.txt)

It isn’t a sure solution about the whole problematic if you’re dealing with malicious actors not identifying themselves correctly. But I guess you perfectly understood that.


This starts to be little boring… but no. There is a plenty of situations when not even Google follows robots.txt.

It is still a suggestion and no one should ever trust on it.

OK, we are thinking the same.

I see two replies that really scared me and I don’t want to pay but soon or later that could be mandatory for the working one.

(I didn’t give my credit card number and all the time use temporary everything, at least for stay a little off the track)

But people are paying and jumped to 4 and 10X, then a 100X, 24 dollars a day. I work in markets directly and that’s surreal.

I usually don’t use this device to search the web (choose captchas for a couple of big business) because I feel more secure and private browsing in Linux. I suspect someone could think in a similar way and I respect if that’s not your case.

Open-source is some kind of controlled too, could sounds a little neurotic or something but I prefer humans conversations in our community and we are discussing limits and maybe use the methods for block something that nobody knows where can stop.

Hallucination was injected, people are cloning theirselves. That could break the information and spread very much control in a joint.

Maybe we are in a good moment to discuss about limits, values, privacy. Not censor, make complaints or avoid a good discuss.

If we are OK in this topic I should share my points and research in deep about my not solid but real points.

AI without OpenAI (not open) could be possible and better tool for communities?

Please, move if you consider that’s OP, or merge if you want.

I don’t know if this concept could be adapted for a forum, but I run this code in my .htaccess file on my blog.

RewriteCond %{HTTP_USER_AGENT} ^.*(aolbuild|baidu|bingbot|bingpreview|msnbot|duckduckgo|mediapartners-google|googlebot|Googlebot|adsbot-google|teoma|slurp|yandex|Baiduspider|facebookexternalhit|applebot|FeedFetcher-Google).*$ [NC]	
RewriteRule ^/?REDIRECT-THIS-URL?$	/TO-THIS-URL	[L,R=301,NC]

The idea here is to only redirect these user agents that visit X page. In my case, I redirect the above user agents that visit current event articles, while continuing to make my Biblical content available for everything. I did this for SEO purposes, which has made a difference, perhaps there is a way to use something like this to block an AI bot?

The issue with my code, for every URL, you need another line of code.


Sure. This is a solution where your web server handles specific user agents a certain way. It is pretty much the same as what I described above. It works as long as the bot identifies itself with a correct user agent.

1 Like

To sort of piggy back off this topic, does anyone know if the ChatGPT user agent is getting the Crawler version? I doubt it… perhaps that should be added to the list of “crawlers”.