How to prevent community content from being used to train LLMs like ChatGPT?

StephaneFe · May 13, 2023, 6:58pm

GPT and other LLMs solutions need training dataset. How to prevent content from our communities to be used to train such models. Should we add something in our conditions of use?

I thought about this after reading Reddit will made some changes to avoid models to be trained on their data without being paid:

Jagster · May 13, 2023, 7:13pm

Are those projects using harvesters that tells some user agent?

merefield · May 13, 2023, 7:30pm

Will it really matter when there are 10 different providers to choose from at cost price?

Learning from the collective works of humanity would seem to be fair enough: it’s what humans do all the time, so why not machines?

Does reddit charge humans for the things they learn on reddit?

This sort of smacks of profiteering by Reddit.

And let’s not get into the fact that all the content on Reddit has been given free by users, so why shouldn’t Reddit pay their users?

pfaffman · May 14, 2023, 12:18am

That seems more like “if I can read a book I get from the library, why can’t I copy it and sell copies to other people?” than “if I can learn from a book can’t a computer.” maybe I’m old, but I’m not ready to think that a bunch of computers running a program is the same as a person.

But I also think that there are already things in place to keep wholesale scraping from taking place. Or maybe indexing sites for search engines is scraping.

These are some interesting times.

Jagster · May 14, 2023, 6:05am

Well, it isn’t widely accepted in human world that someone enters to other’s homes and workplaces, copies everything and then re-create everything to own benefits to make money.

This isn’t easy question. There is really big moral, ethics and financial question that can summarize to two: are copyrights and patents acceptable virtual property or not.

For me this is quite easy problem, though. Perhaps because of I am so small and basically simple minded fish. Right away when I have to pay someone’s business that is wanted to sell back to me I’m against. That’t why I hate so deeply all bot traffic.

Again: AI-question is really much bigger that ChatGPT. And I know and understand that. But why would or should I pay when it is teached to language models?

Widely known fun fact of ChatGPT

In finnish world I’m really big influencer when a topic is dog feeding. I’ve done this shy over 30 years, and created public texts a lot. Actually my site is the biggest informative site (and I would like to say the most important one ) in finnish.

If I ask anything about dog nutrition in english ChatGPT gives old and widely un-accurate barf-theories. If I do same question in finnish I’ll get my own texts.

That’s happend because way of learning of ChatGPT follows thinking a million flies can’t be wrong.

merefield · May 14, 2023, 8:01am

Because it’s not copying verbatim.

No-one charges anyone for reading a book about communism in a library and then going on a political talk show advocating communism.

The bots are learning patterns in a similar way that we do.

Also, in a court of law, without having supervised the learning process, how would you know if it has or not?

In the law of copyright it’s is surely straightforward to prove if someone has copied your work, but here it is neither copying nor is it easy to prove you have had access.

Is anything truly novel on Reddit in any case?!?

satonotdead · May 14, 2023, 8:56am

Sorry but I don’t think so. AI remember patterns and make some relationships but it can’t Intuit, feel or truly create.

AI don’t think in a proper way as humans and don’t register time, feelings and life.

BTW I’m agree with the rest of your point of view. Collaboration plus found and share use cases is good for everyone (at least for not being displaced that’s seems to be inevitable for people that don’t learn how to feel, intuit or create).

The overall situation remembers me Industrial ‘Revolution’ and some dystopic movies

merefield · May 14, 2023, 8:59am

I’m going to disagree back at you, because you are missing my point.

My use of the term “similar” was justified because they are developing ways to identify things by feature just like humans, as opposed to verbatim copying the data and storing it: it is that distinction I’m pointing out and this is a critical distinction, both logically and potentially legally.

Feelings and emotions are irrelevant to the discussion here: the topic is storage and reproduction of knowledge. And on that topic, AI is almost certainly using similar techniques to the human brain to train itself and then use that model.

And that is how things were developed in this space: they created models that were an approximation of how neural nets appeared to work in our brains and then scaled them up. And lo and behold: it started to behave very like a human - more like any natural language model ever has. This pretty much proves out my point.

satonotdead · May 14, 2023, 9:10am

That’s impossible when relates to humans

(And probably that’s motivated the OP)

We can still disagree and I didn’t go further. I respect you and just share my points of view.

merefield · May 14, 2023, 9:13am

You are arguing that a Rolls Royce is a better car, but it’s still a car.

AI has now got to the point where it is behaving very like a human. Very sophisticated behaviour is emerging but that is no accident, because scientists have sought to copy the techniques of human learning.

Of course there are other layers to consider and emotions are but one (another huge one is the concept of ‘ego’ and the importance of human-like sensory information, even vestibular, which is thought critical to perception of ‘ego’), but this doesn’t alter the argument here imho.

satonotdead · May 14, 2023, 9:16am

Nope, I only said that AI can’t learn like humans (act like is not learn like). That’s not nearly possible and I think is important to take in mind.

Then I’m agree with public data is public. And for me is super OK to have differences, that’s makes us humans (and not AI)

merefield · May 14, 2023, 9:18am

This is simply wrong, imho.

The strides we’ve made in this space are almost certainly because the AI is learning (more) like humans.

satonotdead · May 14, 2023, 9:23am

Only at conceptual layer, there is a lot more (!)

@StephaneFe may I ask why are you looking to limit the ‘AI training process’? (That’s human empathy )

merefield · May 14, 2023, 9:25am

I never claimed there wasn’t a lot (!) more?

I’m just making one central distinction:

Which is that the AI is learning from features (as we do) and not copying exact information. It is learning to generalise and not rely on complete detail in order to make distinctions.

Because of that it doesn’t have to store complete works in high definition, verbatim.

No doubt there are loads of other learning techniques that have not yet been incorporated, but this technique very much has.

StephaneFe · May 14, 2023, 9:33am

Can we focus on the how and not the why?

The topic is not to discuss if this justified or not to prevent our data from being used, but how to do it?

Are there effective ways to prevent scraping in general? E.g requiring sign in to access most content?

merefield · May 14, 2023, 9:39am

I think morally and technically it is justified.

I actually find it abhorrent that jazz songs written in the 1930’s are subject to copyright, when you could argue that many features of music are inherently human phenomenon that no-one should own: take the example of “the circle of fifths” - this is an implicit structure in music that helps form many songs, from simple 3 chord rock songs of the 50s to highly sophisticated Jazz tunes.

And as I’ve suggested, we aren’t talking about storing and regurgitating copyright material here verbatim.

Preventing AI from using features of music like the circle of fifths just because most music is subject to copyright is ridiculous!

You could argue the authors of that music benefitted greatly from the human condition and have already profited handsomely. Why a great grandchild should earn money from a work of their ancestor which is itself based on general knowledge confounds me.

JammyDodger · May 14, 2023, 10:27am

I’m afraid I’m not an expert on this, but I don’t think crawlers can access content if a site is not publicly visible, so if that’s an option for you it may be the most effective way.

mattdm · May 15, 2023, 3:43am

This is not at all the case. These tools are in some ways inspired by biological neural concepts, but in actual implementation are not functionally similar. This may sound like a nit-pick, but I think it’s very important, because the argument seems philosophically compelling. Analogies can be very dangerous in that way.

Here are some specific ways computational neural nets are not “learning pattens in a similar way that we do”.

our neurons are connected locally and multi-dimensionally, with some dense clusters and other less-connected ones; neural nets are typically arranged in layers, with each layer either fully interconnected or an intentionally-designed “convolutional” layer.
biological brains operate asynchronously, with neurons firing at different rates, and with the frequency itself carrying information. Neural nets are basically massively-parallel operations. (This is why they are so well suited to GPGPU computing.)
neurons are responsible for both computation and memory. There is no separate storage or retrieval, or function execution. This alone makes a very different kind of processing system.
weirdly: brain communication is more binary than what we’re doing with computers: a neuron fires or doesn’t, while an “artificial neuron” usually inputs and outputs ranges of continuous values (represented as floating point). (Again, this is not processing at all similarly to the way we understand brains to function.)
learning works differently: in human learning, the connections actually change. (We don’t understand this very well.) In a neural net, the architecture is chosen and fixed, and the “learning” a matter of adjusting weights. (Ironically, we don’t understand this very well either, really.)

This is also a really useful read: What Is ChatGPT Doing … and Why Does It Work?—Stephen Wolfram Writings

mattdm · May 15, 2023, 4:14am

Specifically, it certainly is not learning to generalise. It is, instead, created so that it has the ability to produce answers which appear to generalize.

But it can’t actually generalize at all.

One interesting exercise with ChatGPT is to ask it about multiplication. It will earnestly claim to have an understanding of the algorithm for long multiplication. Indeed, if you as it to multiply two- or three-digit numbers, it will likely (but, not actually certainly!) give the right answer. But then try five or six digit numbers. It will give answers that look like the right number of digits but will not actually be right.

If you ask it to explain, it will say that it followed an algorithm, and if you ask it to show its work, it will, and it will be nonsense that is shaped like the right answer. You will probably even find, in the steps, completely wrong single digit multiplication. It doesn’t actually “know” that these steps are the same thing as the single-digit multiplication it has just confidantly done a few minutes back, because it hasn’t actually generalized any of it.

And, math is nothing special here. It’s just an easy way to pull back the curtain a bit. The same basic thing happens in trying to get it to write a poem.

Don’t get me wrong! I think we can do some amazing things with AI even as it exists today. But let’s please not form our policies around analogies.

Jagster · May 15, 2023, 6:07am

No they aren’t. They are learning propblities how words are commecting together. And that leads to de facto copy&paste.

We are learning processing knowledge.

Topic		Replies	Views
What is stopping you from trying out Discourse AI? Community ai	26	1428	June 25, 2024
How are we all feeling about ChatGPT and other LLMs and how they'll impact forums? Community ai	103	7920	February 13, 2025
Best practices dealing with Spam users and GPT reply posts Community	9	882	July 31, 2023
Is there any AI at the core of standard Discourse? Support	15	1445	May 31, 2023
Integrating GPT3-like bots? Dev	63	4348	May 10, 2023

How to prevent community content from being used to train LLMs like ChatGPT?

Related topics