I see what you are saying, and I made a similar comment.
However it's still an interesting observation that many architectures can arrive at the same performance (even though the training requirements are different).
Naively, you wouldn't expect eg 'x -> a * x + b' to fit the same data as 'x -> a * sin x + b' about equally well. But that's an observation from low dimensions. It seems once you add enough parameters, the exact model doesn't matter too much for practical expressiveness.
I'm faintly reminded of the Church-Turing Thesis; the differences between different computing architectures are both 'real' but also 'just an optimisation'.
> When you start layering in normalisation techniques to minimise overfitting, and especially once you start thinking about more agentic architectures (eg. Deep Q Learning, some of the search space design going into OpenAI's o1), then I don't think the just-an-optimisation perspective can hold much water at all - more computing power simply couldn't solve those problems with older architectures.
You are right, these normalisation techniques help you economise on training data, not just on compute. Some of these techniques can be done independent of the model, eg augmenting your training data with noise. But some others are very model dependent.
I'm not sure how the 'agentic' approaches fit here.
Is multiplication versus sine in the analogy hiding it, perhaps?
I've always pictured it as just "needing to learn" the function terms and the function guts are an abstraction that is learned.
Might just be because I'm a physics dropout with a bunch of whacky half-remembered probably-wrong stuff about how any function can be approximated by ex. fourier series.
So (most) neural nets can be seen as a function of a _fixed_ form with some inputs and lots and lots of parameters.
In my example, a and b were the parameters. The kinds of data you can approximate well with a simple sine wave and the kinds of data you can approximate with a straight line are rather different.
Training your neural net only fiddles with the parameters like a and b. It doesn't do anything about the shape of the function. It doesn't change sine into multiplication etc.
> [...] about how any function can be approximated by ex. fourier series.
Fourier series are an interesting example to bring up! I think I see what you mean.
In theory they work well to approximate any function over either a periodic domain or some finite interval. But unless you take special care, when you apply Fourier analysis naively it becomes extremely sensitive to errors in the phase parameters.
(Special care could eg be done by hacking up your input domain into 'boxes'. That works well for eg audio or video compression, but gives up on any model generalisation between 'boxes', especially for what would happen in a later box.)
Another interesting example is Taylor series. For many simple functions Taylor series are great, but for even moderately complicated ones you need to be careful. See eg how the Taylor serious for the logarithm around x=1 works well, but if you tried it around x=0, you are in for a bad time.
The interesting observation isn't just that there are multiple universal approximators, but that at high enough parameter count, they seem to perform about equally well in how good they are at approximating in practice (but differ in how well they can be trained).
> Training your neural net only fiddles with the parameters like a and b. It doesn't do anything about the shape of the function. It doesn't change sine into multiplication etc.
It definitely can. The output will always be piecewise linear (with ReLU), but the overall shape can change completely.
You can fit any data with enough parameters. What’s tricky is to constrain a model so that it approximates the ground truth well where there are no data points. If a family of functions is extremely flexible and can fit all kinds of data very efficiently I would argue it makes it harder for those functions to have correct values out of distribution.
Definitely. That's a fundamental observation called the bias-variance tradeoff. More flexible models are prone to overfitting, hitting each training point exactly with wild gyrations in between.
Big AI minimizes that problem by using more data. So much data that the model often only sees each data point once and overfitting is unlikely.
But while keeping the data constant, adding more and more parameters is a strategy that works, so what gives? Are the functions getting somehow regularized during training so effectively you could get away with fewer parameters, it's just that we don't have the right model just yet?
More directly than my first attempt: you're continuing the error here. The nave's approach of "it's approximating some function" both maps to reality and makes accurate predictions. The more we couple ourselves to "no no no, it's modeling a precise function", the more we end up wrong, both on how it works in theory and in practice.
Huh? Who says anything about 'precise functions'? And what's a precise function in the first place?
I am saying that training (at least for conventional neural nets) only fiddles with some parameters. But it does not change the shape of the network, no new nodes nor different connections. (Which is almost equivalent to saying training doesn't change the abstract syntax tree, if you were to write the network out as a procedure in, say, Python.)
The geometric shape you get when you print out the function changes, yes.
This reminds me of control systems theory where provided there's feedback, the forward transfer function doesn't matter beyond very basic properties around the origin.
However it's still an interesting observation that many architectures can arrive at the same performance (even though the training requirements are different).
Naively, you wouldn't expect eg 'x -> a * x + b' to fit the same data as 'x -> a * sin x + b' about equally well. But that's an observation from low dimensions. It seems once you add enough parameters, the exact model doesn't matter too much for practical expressiveness.
I'm faintly reminded of the Church-Turing Thesis; the differences between different computing architectures are both 'real' but also 'just an optimisation'.
> When you start layering in normalisation techniques to minimise overfitting, and especially once you start thinking about more agentic architectures (eg. Deep Q Learning, some of the search space design going into OpenAI's o1), then I don't think the just-an-optimisation perspective can hold much water at all - more computing power simply couldn't solve those problems with older architectures.
You are right, these normalisation techniques help you economise on training data, not just on compute. Some of these techniques can be done independent of the model, eg augmenting your training data with noise. But some others are very model dependent.
I'm not sure how the 'agentic' approaches fit here.