ChatGPT *is* trained on LibGen, among others, no? To the best of my knowledge, *...

dirheist · on Feb 1, 2023

It is, it's libgen + commoncrawl + wikidump + a bunch of other datasets. OpenAI claim that commoncrawl is roughly 60% of its total training corpus and they also claim they use the other datasets listed. They probably also have some sort of proprietary Q&A/search query corpus via Microsoft.

humanistbot · on Feb 1, 2023

> It is, it's libgen + commoncrawl + wikidump + a bunch of other datasets.

I'm having trouble finding a source for the libgen claim. Is that confirmed or just rumor?

mandmandam · on Feb 1, 2023

The ChatGPT Prompt book by LifeArchitect.ai is where I saw it: https://docs.google.com/presentation/d/17b_ocq-GL5lhV_bYSShz...

dblitt · on Feb 2, 2023

> Informed 'best guess' only. > Sources: https://lifearchitect.ai/papers/

Doesn't seem too convincing to me

gnopgnip · on Feb 1, 2023

Copyright doesn't really factor in what went into the creation, it is about what is published and whether that is infringing

fnordpiglet · on Feb 1, 2023

I’ll wager $10 it falls under fair use.

sometimeshuman · on Feb 1, 2023

I often cited example is to write something in the style of "Dr. Suess". Doesn't this imply that Dr. Suess's books are in the training data set ? How can one find out what other books, screenplays, magazines, etc. are in the training data.

creata · on Feb 1, 2023

> Doesn't this imply that Dr. Suess's books are in the training data set ?

Or maybe that lots of people online like to write (and challenge each other to write) in the style of Dr. Seuss.

ygouzerh · on Feb 1, 2023

Is it pirated materials if it's publicly accessible ? It's quite similar to someone reading the web

flangola7 · on Feb 1, 2023

It is trained on days from piracy trackers, not just the open web.