How to prevent community content from being used to train LLMs like ChatGPT?

GPT and other LLMs solutions need training dataset. How to prevent content from our communities to be used to train such models. Should we add something in our conditions of use?

I thought about this after reading Reddit will made some changes to avoid models to be trained on their data without being paid:

11 Likes

Are those projects using harvesters that tells some user agent?

2 Likes

Will it really matter when there are 10 different providers to choose from at cost price?

Learning from the collective works of humanity would seem to be fair enough: itā€™s what humans do all the time, so why not machines?

Does reddit charge humans for the things they learn on reddit?

This sort of smacks of profiteering by Reddit.

And letā€™s not get into the fact that all the content on Reddit has been given free by users, so why shouldnā€™t Reddit pay their users?

6 Likes

That seems more like ā€œif I can read a book I get from the library, why canā€™t I copy it and sell copies to other people?ā€ than ā€œif I can learn from a book canā€™t a computer.ā€ maybe Iā€™m old, but Iā€™m not ready to think that a bunch of computers running a program is the same as a person.

But I also think that there are already things in place to keep wholesale scraping from taking place. Or maybe indexing sites for search engines is scraping.

These are some interesting times.

5 Likes

Well, it isnā€™t widely accepted in human world that someone enters to otherā€™s homes and workplaces, copies everything and then re-create everything to own benefits to make money.

This isnā€™t easy question. There is really big moral, ethics and financial question that can summarize to two: are copyrights and patents acceptable virtual property or not.

For me this is quite easy problem, though. Perhaps because of I am so small and basically simple minded fish. Right away when I have to pay someoneā€™s business that is wanted to sell back to me Iā€™m against. Thatā€™t why I hate so deeply all bot traffic.

Again: AI-question is really much bigger that ChatGPT. And I know and understand that. But why would or should I pay when it is teached to language models?

Widely known fun fact of ChatGPT

In finnish world Iā€™m really big influencer when a topic is dog feeding. Iā€™ve done this shy over 30 years, and created public texts a lot. Actually my site is the biggest informative site (and I would like to say the most important one :wink: ) in finnish.

If I ask anything about dog nutrition in english ChatGPT gives old and widely un-accurate barf-theories. If I do same question in finnish Iā€™ll get my own texts.

Thatā€™s happend because way of learning of ChatGPT follows thinking a million flies canā€™t be wrong.

5 Likes

Because itā€™s not copying verbatim.

No-one charges anyone for reading a book about communism in a library and then going on a political talk show advocating communism.

The bots are learning patterns in a similar way that we do.

Also, in a court of law, without having supervised the learning process, how would you know if it has or not?

In the law of copyright itā€™s is surely straightforward to prove if someone has copied your work, but here it is neither copying nor is it easy to prove you have had access.

Is anything truly novel on Reddit in any case?!?

2 Likes

Sorry but I donā€™t think so. AI remember patterns and make some relationships but it canā€™t Intuit, feel or truly create.

AI donā€™t think in a proper way as humans and donā€™t register time, feelings and life.

BTW Iā€™m agree with the rest of your point of view. Collaboration plus found and share use cases is good for everyone (at least for not being displaced thatā€™s seems to be inevitable for people that donā€™t learn how to feel, intuit or create).

The overall situation remembers me Industrial ā€˜Revolutionā€™ and some dystopic movies :slight_smile:

2 Likes

Iā€™m going to disagree back at you, because you are missing my point.

My use of the term ā€œsimilarā€ was justified because they are developing ways to identify things by feature just like humans, as opposed to verbatim copying the data and storing it: it is that distinction Iā€™m pointing out and this is a critical distinction, both logically and potentially legally.

Feelings and emotions are irrelevant to the discussion here: the topic is storage and reproduction of knowledge. And on that topic, AI is almost certainly using similar techniques to the human brain to train itself and then use that model.

And that is how things were developed in this space: they created models that were an approximation of how neural nets appeared to work in our brains and then scaled them up. And lo and behold: it started to behave very like a human - more like any natural language model ever has. This pretty much proves out my point.

3 Likes

Thatā€™s impossible when relates to humans :slight_smile:

(And probably thatā€™s motivated the OP)

We can still disagree and I didnā€™t go further. I respect you and just share my points of view.

2 Likes

You are arguing that a Rolls Royce is a better car, but itā€™s still a car.

AI has now got to the point where it is behaving very like a human. Very sophisticated behaviour is emerging but that is no accident, because scientists have sought to copy the techniques of human learning.

Of course there are other layers to consider and emotions are but one (another huge one is the concept of ā€˜egoā€™ and the importance of human-like sensory information, even vestibular, which is thought critical to perception of ā€˜egoā€™), but this doesnā€™t alter the argument here imho.

2 Likes

Nope, I only said that AI canā€™t learn like humans (act like is not learn like). Thatā€™s not nearly possible and I think is important to take in mind.

Then Iā€™m agree with public data is public. And for me is super OK to have differences, thatā€™s makes us humans (and not AI) :grimacing:

2 Likes

This is simply wrong, imho.

The strides weā€™ve made in this space are almost certainly because the AI is learning (more) like humans.

3 Likes

Only at conceptual layer, there is a lot more (!)

image

@StephaneFe may I ask why are you looking to limit the ā€˜AI training processā€™? (Thatā€™s human empathy :orange_heart:)

2 Likes

I never claimed there wasnā€™t a lot (!) more?

Iā€™m just making one central distinction:

Which is that the AI is learning from features (as we do) and not copying exact information. It is learning to generalise and not rely on complete detail in order to make distinctions.

Because of that it doesnā€™t have to store complete works in high definition, verbatim.

No doubt there are loads of other learning techniques that have not yet been incorporated, but this technique very much has.

2 Likes

Can we focus on the how and not the why?

The topic is not to discuss if this justified or not to prevent our data from being used, but how to do it?

Are there effective ways to prevent scraping in general? E.g requiring sign in to access most content?

9 Likes

I think morally and technically it is justified.

I actually find it abhorrent that jazz songs written in the 1930ā€™s are subject to copyright, when you could argue that many features of music are inherently human phenomenon that no-one should own: take the example of ā€œthe circle of fifthsā€ - this is an implicit structure in music that helps form many songs, from simple 3 chord rock songs of the 50s to highly sophisticated Jazz tunes.

And as Iā€™ve suggested, we arenā€™t talking about storing and regurgitating copyright material here verbatim.

Preventing AI from using features of music like the circle of fifths just because most music is subject to copyright is ridiculous!

You could argue the authors of that music benefitted greatly from the human condition and have already profited handsomely. Why a great grandchild should earn money from a work of their ancestor which is itself based on general knowledge confounds me.

5 Likes

Iā€™m afraid Iā€™m not an expert on this, but I donā€™t think crawlers can access content if a site is not publicly visible, so if thatā€™s an option for you it may be the most effective way.

9 Likes

This is not at all the case. These tools are in some ways inspired by biological neural concepts, but in actual implementation are not functionally similar. This may sound like a nit-pick, but I think itā€™s very important, because the argument seems philosophically compelling. Analogies can be very dangerous in that way.

Here are some specific ways computational neural nets are not ā€œlearning pattens in a similar way that we doā€.

  • our neurons are connected locally and multi-dimensionally, with some dense clusters and other less-connected ones; neural nets are typically arranged in layers, with each layer either fully interconnected or an intentionally-designed ā€œconvolutionalā€ layer.
  • biological brains operate asynchronously, with neurons firing at different rates, and with the frequency itself carrying information. Neural nets are basically massively-parallel operations. (This is why they are so well suited to GPGPU computing.)
  • neurons are responsible for both computation and memory. There is no separate storage or retrieval, or function execution. This alone makes a very different kind of processing system.
  • weirdly: brain communication is more binary than what weā€™re doing with computers: a neuron fires or doesnā€™t, while an ā€œartificial neuronā€ usually inputs and outputs ranges of continuous values (represented as floating point). (Again, this is not processing at all similarly to the way we understand brains to function.)
  • learning works differently: in human learning, the connections actually change. (We donā€™t understand this very well.) In a neural net, the architecture is chosen and fixed, and the ā€œlearningā€ a matter of adjusting weights. (Ironically, we donā€™t understand this very well either, really.)

This is also a really useful read: What Is ChatGPT Doing ā€¦ and Why Does It Work?ā€”Stephen Wolfram Writings

5 Likes

Specifically, it certainly is not learning to generalise. It is, instead, created so that it has the ability to produce answers which appear to generalize.

But it canā€™t actually generalize at all.

One interesting exercise with ChatGPT is to ask it about multiplication. It will earnestly claim to have an understanding of the algorithm for long multiplication. Indeed, if you as it to multiply two- or three-digit numbers, it will likely (but, not actually certainly!) give the right answer. But then try five or six digit numbers. It will give answers that look like the right number of digits but will not actually be right.

If you ask it to explain, it will say that it followed an algorithm, and if you ask it to show its work, it will, and it will be nonsense that is shaped like the right answer. You will probably even find, in the steps, completely wrong single digit multiplication. It doesnā€™t actually ā€œknowā€ that these steps are the same thing as the single-digit multiplication it has just confidantly done a few minutes back, because it hasnā€™t actually generalized any of it.

And, math is nothing special here. Itā€™s just an easy way to pull back the curtain a bit. The same basic thing happens in trying to get it to write a poem.

Donā€™t get me wrong! I think we can do some amazing things with AI even as it exists today. But letā€™s please not form our policies around analogies.

4 Likes

No they arenā€™t. They are learning propblities how words are commecting together. And that leads to de facto copy&paste.

We are learning processing knowledge.

1 Like