It is, it's libgen + commoncrawl + wikidump + a bunch of other datasets. OpenAI claim that commoncrawl is roughly 60% of its total training corpus and they also claim they use the other datasets listed. They probably also have some sort of proprietary Q&A/search query corpus via Microsoft.
I often cited example is to write something in the style of "Dr. Suess". Doesn't this imply that Dr. Suess's books are in the training data set ? How can one find out what other books, screenplays, magazines, etc. are in the training data.
To the best of my knowledge, all of these generators are taking mountains of content without asking the creators, aka, pirated materials.