Text is an instance of language. Think of it as the difference between the python language and a large collection of python programs. The language describes syntactic and semantic rules, the collection is a sampling of possible programs that encodes a significant amount of information about the world. You could learn a lot about the laws of nature, the internet, even human society and laws by examining all the python programs ever written.
An extreme version of the same idea is the difference between understanding DNA vs the genome of every individual organism that has lived on earth. The species record encodes a ton of information about the laws of nature, the composition and history of our planet. You could deduce physical laws and constants from looking at this information, wars and natural disasters, economic performance, historical natural boundaries, the industrial revolution and a lot more.
If a student studies DNA sequencing, they’ll learn about the compounds that make up DNA, how traits get encoded, etc.
Therefore the student might expect an AI trained on people’s DNA to be able to tell you about whether certain traits are more prevalent in one geography or the other.
However, since DNA responds to changes in environment, the AI would start to see time, population, and geography-based patterns emerge.
The AI for example could infer that a given person in the US who’s settled in NYC had ancestors from a given region of the world who left due to an environmental disaster just by looking at a given DNA sequence.
To the student this result would look like magic. But in the end, it’s a result of individual’s DNA having much more information encoded in it than just human traits.
text and language intersect. in some ways, text is a superset of language, mostly due to social, or what is also called pragmatic, factors that complement semantics. also, the semantics/syntax interface is everything else than clear cut, at least in natural human languages.
Any text corpus is a subset of the language, under the normal definition that a language is the set of all possible sentences (or a set of rules to recognize or generate that set of possibilities). This text subset has an intrinsic bias as to which sentences were selected to represent real language use, which would be significant as a training set for an ML model.
So, perhaps you are saying that the text corpus carries more "world" information than the language, because of the implications you can draw from this selection process? The full language tells us how to encode meaning into sentences, but not what sentences are important to a population who uses language to describe their world. So, if we took a fuzz-tester and randomly generated possible texts to train a large language model, we would no longer expect it to predict use by an actual population. It would probably be more like a Markov chain model, generating bizarre gibberish that merely has valid syntax.
And, this is also seems to apply if you train the model on a selection from one population but then try to use the mode to predict a different population. Wouldn't it be progressively less able to predict usage as the populations have less overlap in their own biased use of language?
regarding the relationship: yes, and in most ways it probably is a subset. is there really such a set of rules that generate all possible sentences? in any case i wanted to say the materiality and cultural activity heavily influences what can and will be put into text and that is not strictly language. "selection process" might capture some, though i'm not sure whether all of it!
I think about this as shape and color. No one ever saw a shape that wasn’t colored and likewise there are no colored things that do not have a shape.
Also, displaying text without a font is not possible.
Text is the surface of the ocean where waves emerge, and while they have their own properties and may seem to naively have agency, they are an expression of the underlying ocean.
nicely put! many aspects of text at least historically have much to do with its materiality (also in a cognitive development sense, learning how to write etc.). what we can think about nowadays is that text and speech might not be a necessary materiality of language. language might depend more on conceptual systems. more like a substrate of intelligence and that might as well be nonhuman (to stay on topic).
An extreme version of the same idea is the difference between understanding DNA vs the genome of every individual organism that has lived on earth. The species record encodes a ton of information about the laws of nature, the composition and history of our planet. You could deduce physical laws and constants from looking at this information, wars and natural disasters, economic performance, historical natural boundaries, the industrial revolution and a lot more.