Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

ChatGPT is trained on LibGen, among others, no?

To the best of my knowledge, all of these generators are taking mountains of content without asking the creators, aka, pirated materials.



It is, it's libgen + commoncrawl + wikidump + a bunch of other datasets. OpenAI claim that commoncrawl is roughly 60% of its total training corpus and they also claim they use the other datasets listed. They probably also have some sort of proprietary Q&A/search query corpus via Microsoft.


> It is, it's libgen + commoncrawl + wikidump + a bunch of other datasets.

I'm having trouble finding a source for the libgen claim. Is that confirmed or just rumor?


The ChatGPT Prompt book by LifeArchitect.ai is where I saw it: https://docs.google.com/presentation/d/17b_ocq-GL5lhV_bYSShz...


> Informed 'best guess' only. > Sources: https://lifearchitect.ai/papers/

Doesn't seem too convincing to me


Copyright doesn't really factor in what went into the creation, it is about what is published and whether that is infringing


I’ll wager $10 it falls under fair use.


I often cited example is to write something in the style of "Dr. Suess". Doesn't this imply that Dr. Suess's books are in the training data set ? How can one find out what other books, screenplays, magazines, etc. are in the training data.


> Doesn't this imply that Dr. Suess's books are in the training data set ?

Or maybe that lots of people online like to write (and challenge each other to write) in the style of Dr. Seuss.


Is it pirated materials if it's publicly accessible ? It's quite similar to someone reading the web


It is trained on days from piracy trackers, not just the open web.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: