How to prevent community content from being used to train LLMs like ChatGPT?

GPT and other LLMs solutions need training dataset. How to prevent content from our communities to be used to train such models. Should we add something in our conditions of use?

I thought about this after reading Reddit will made some changes to avoid models to be trained on their data without being paid:


Are those projects using harvesters that tells some user agent?


Will it really matter when there are 10 different providers to choose from at cost price?

Learning from the collective works of humanity would seem to be fair enough: it’s what humans do all the time, so why not machines?

Does reddit charge humans for the things they learn on reddit?

This sort of smacks of profiteering by Reddit.

And let’s not get into the fact that all the content on Reddit has been given free by users, so why shouldn’t Reddit pay their users?


That seems more like “if I can read a book I get from the library, why can’t I copy it and sell copies to other people?” than “if I can learn from a book can’t a computer.” maybe I’m old, but I’m not ready to think that a bunch of computers running a program is the same as a person.

But I also think that there are already things in place to keep wholesale scraping from taking place. Or maybe indexing sites for search engines is scraping.

These are some interesting times.


Well, it isn’t widely accepted in human world that someone enters to other’s homes and workplaces, copies everything and then re-create everything to own benefits to make money.

This isn’t easy question. There is really big moral, ethics and financial question that can summarize to two: are copyrights and patents acceptable virtual property or not.

For me this is quite easy problem, though. Perhaps because of I am so small and basically simple minded fish. Right away when I have to pay someone’s business that is wanted to sell back to me I’m against. That’t why I hate so deeply all bot traffic.

Again: AI-question is really much bigger that ChatGPT. And I know and understand that. But why would or should I pay when it is teached to language models?

Widely known fun fact of ChatGPT

In finnish world I’m really big influencer when a topic is dog feeding. I’ve done this shy over 30 years, and created public texts a lot. Actually my site is the biggest informative site (and I would like to say the most important one :wink: ) in finnish.

If I ask anything about dog nutrition in english ChatGPT gives old and widely un-accurate barf-theories. If I do same question in finnish I’ll get my own texts.

That’s happend because way of learning of ChatGPT follows thinking a million flies can’t be wrong.


Because it’s not copying verbatim.

No-one charges anyone for reading a book about communism in a library and then going on a political talk show advocating communism.

The bots are learning patterns in a similar way that we do.

Also, in a court of law, without having supervised the learning process, how would you know if it has or not?

In the law of copyright it’s is surely straightforward to prove if someone has copied your work, but here it is neither copying nor is it easy to prove you have had access.

Is anything truly novel on Reddit in any case?!?


Sorry but I don’t think so. AI remember patterns and make some relationships but it can’t Intuit, feel or truly create.

AI don’t think in a proper way as humans and don’t register time, feelings and life.

BTW I’m agree with the rest of your point of view. Collaboration plus found and share use cases is good for everyone (at least for not being displaced that’s seems to be inevitable for people that don’t learn how to feel, intuit or create).

The overall situation remembers me Industrial ‘Revolution’ and some dystopic movies :slight_smile:


I’m going to disagree back at you, because you are missing my point.

My use of the term “similar” was justified because they are developing ways to identify things by feature just like humans, as opposed to verbatim copying the data and storing it: it is that distinction I’m pointing out and this is a critical distinction, both logically and potentially legally.

Feelings and emotions are irrelevant to the discussion here: the topic is storage and reproduction of knowledge. And on that topic, AI is almost certainly using similar techniques to the human brain to train itself and then use that model.

And that is how things were developed in this space: they created models that were an approximation of how neural nets appeared to work in our brains and then scaled them up. And lo and behold: it started to behave very like a human - more like any natural language model ever has. This pretty much proves out my point.


That’s impossible when relates to humans :slight_smile:

(And probably that’s motivated the OP)

We can still disagree and I didn’t go further. I respect you and just share my points of view.


You are arguing that a Rolls Royce is a better car, but it’s still a car.

AI has now got to the point where it is behaving very like a human. Very sophisticated behaviour is emerging but that is no accident, because scientists have sought to copy the techniques of human learning.

Of course there are other layers to consider and emotions are but one (another huge one is the concept of ‘ego’ and the importance of human-like sensory information, even vestibular, which is thought critical to perception of ‘ego’), but this doesn’t alter the argument here imho.


Nope, I only said that AI can’t learn like humans (act like is not learn like). That’s not nearly possible and I think is important to take in mind.

Then I’m agree with public data is public. And for me is super OK to have differences, that’s makes us humans (and not AI) :grimacing:


This is simply wrong, imho.

The strides we’ve made in this space are almost certainly because the AI is learning (more) like humans.


Only at conceptual layer, there is a lot more (!)


@StephaneFe may I ask why are you looking to limit the ‘AI training process’? (That’s human empathy :orange_heart:)


I never claimed there wasn’t a lot (!) more?

I’m just making one central distinction:

Which is that the AI is learning from features (as we do) and not copying exact information. It is learning to generalise and not rely on complete detail in order to make distinctions.

Because of that it doesn’t have to store complete works in high definition, verbatim.

No doubt there are loads of other learning techniques that have not yet been incorporated, but this technique very much has.


Can we focus on the how and not the why?

The topic is not to discuss if this justified or not to prevent our data from being used, but how to do it?

Are there effective ways to prevent scraping in general? E.g requiring sign in to access most content?


I think morally and technically it is justified.

I actually find it abhorrent that jazz songs written in the 1930’s are subject to copyright, when you could argue that many features of music are inherently human phenomenon that no-one should own: take the example of “the circle of fifths” - this is an implicit structure in music that helps form many songs, from simple 3 chord rock songs of the 50s to highly sophisticated Jazz tunes.

And as I’ve suggested, we aren’t talking about storing and regurgitating copyright material here verbatim.

Preventing AI from using features of music like the circle of fifths just because most music is subject to copyright is ridiculous!

You could argue the authors of that music benefitted greatly from the human condition and have already profited handsomely. Why a great grandchild should earn money from a work of their ancestor which is itself based on general knowledge confounds me.


I’m afraid I’m not an expert on this, but I don’t think crawlers can access content if a site is not publicly visible, so if that’s an option for you it may be the most effective way.


This is not at all the case. These tools are in some ways inspired by biological neural concepts, but in actual implementation are not functionally similar. This may sound like a nit-pick, but I think it’s very important, because the argument seems philosophically compelling. Analogies can be very dangerous in that way.

Here are some specific ways computational neural nets are not “learning pattens in a similar way that we do”.

  • our neurons are connected locally and multi-dimensionally, with some dense clusters and other less-connected ones; neural nets are typically arranged in layers, with each layer either fully interconnected or an intentionally-designed “convolutional” layer.
  • biological brains operate asynchronously, with neurons firing at different rates, and with the frequency itself carrying information. Neural nets are basically massively-parallel operations. (This is why they are so well suited to GPGPU computing.)
  • neurons are responsible for both computation and memory. There is no separate storage or retrieval, or function execution. This alone makes a very different kind of processing system.
  • weirdly: brain communication is more binary than what we’re doing with computers: a neuron fires or doesn’t, while an “artificial neuron” usually inputs and outputs ranges of continuous values (represented as floating point). (Again, this is not processing at all similarly to the way we understand brains to function.)
  • learning works differently: in human learning, the connections actually change. (We don’t understand this very well.) In a neural net, the architecture is chosen and fixed, and the “learning” a matter of adjusting weights. (Ironically, we don’t understand this very well either, really.)

This is also a really useful read: What Is ChatGPT Doing … and Why Does It Work?—Stephen Wolfram Writings


Specifically, it certainly is not learning to generalise. It is, instead, created so that it has the ability to produce answers which appear to generalize.

But it can’t actually generalize at all.

One interesting exercise with ChatGPT is to ask it about multiplication. It will earnestly claim to have an understanding of the algorithm for long multiplication. Indeed, if you as it to multiply two- or three-digit numbers, it will likely (but, not actually certainly!) give the right answer. But then try five or six digit numbers. It will give answers that look like the right number of digits but will not actually be right.

If you ask it to explain, it will say that it followed an algorithm, and if you ask it to show its work, it will, and it will be nonsense that is shaped like the right answer. You will probably even find, in the steps, completely wrong single digit multiplication. It doesn’t actually “know” that these steps are the same thing as the single-digit multiplication it has just confidantly done a few minutes back, because it hasn’t actually generalized any of it.

And, math is nothing special here. It’s just an easy way to pull back the curtain a bit. The same basic thing happens in trying to get it to write a poem.

Don’t get me wrong! I think we can do some amazing things with AI even as it exists today. But let’s please not form our policies around analogies.


No they aren’t. They are learning propblities how words are commecting together. And that leads to de facto copy&paste.

We are learning processing knowledge.

1 Like