What are the prerequisites?

cpp_frog · on June 12, 2021

While I can't give the exact prerequisites, I know that all of the things that appear in the paper relate to:

(1) Linear Algebra

(2) Optimization Theory (Convex Analysis, non-convex optimization) [0], [2]

(3) Probability Theory and Statistics (Measure Theory, Multivariate Statistics) [1], [3], [4], [5]

(4) Analysis, to a lesser extent. (2) and (3) are the most important.

I would give more references, but my background is too theoretical (and my field is Numerical Analysis of PDE). From the classes I took in college, three or four on each of (1-4), a person with a similar background can recognize the tools without much digging. Maybe some folks here can provide some insights into books that center on applications. So I'm trying not to diverge into too much theory (i.e. for measures, [4] instead of Folland). There also seems to be good use of Analysis techniques in the paper, see theorem 2.1.

I love that the paper references the Moore-Penrose pseudo-inverse, an object of study in both statistics and optimization for which I had to give a lecture for a course.

[0] https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf Convex Optimization, Boyd and Vandenberghe

[1] An Introduction to Multivariate Statistical Analysis, Anderson

[2] Convex Analysis and Monotone Operator Theory in Hilbert Spaces, Bauschke-Combettes

[3] Theory of Multivariate Statistics, Bilodeau-Brenner

[4] The Elements of Integration and Lebesgue Measure, Bartle

[5] Probability: Theory and Examples, Durrett

keithalewis · on June 12, 2021

Mind reading. They use terminology without defining it or giving a reference.

beforeolives · on June 12, 2021

Seriously, I'm struggling to understand things that I already know.

sundarurfriend · on June 12, 2021

Any examples? I haven't yet come across something like that yet, but I'm only a short way into the article.

keithalewis · on June 12, 2021

The terms "measurable" and "tempered" for starters.

cpp_frog · on June 12, 2021

The term measurable is referring to "measurable functions" in measure theory, which correspond to functions verifying that the pre-image of any measurable set belonging to the sigma-algebra of the codomain belongs to the sigma-algebra of the domain (https://en.wikipedia.org/wiki/Measurable_function). I do not know how to state it in simpler terms, sorry. When the measure of the domain is 1 (as in a probability space), we call measurable functions random variables, hence their relevance to this topic.

Now, tempered distributions are functions that assign a complex number to a very rapidly decaying function (a Schwarz space function), and it satisfies linearity properties. So this is a function that takes functions and maps them to complex numbers. https://secure.math.ubc.ca/~feldman/m321/distributions.pdf

keithalewis · on June 13, 2021

Are they talking about the Borel sigma-algebra generated by the open sets of a topological space? What topology is in their mind? Are tempered distributions functions? How does one compose two tempered distributions? (Hint: you can't, and they never actually use tempered distributions.) This is just mathematical masturbation. Everything is finite when implemented on a computer so there is no need for such dainty mathematical niceties unless you are trying lend credence to pedestrian observations about Chebyshev's inequality.

julbern · on June 14, 2021

If not further specified, the topology is induced by the metric or norm of the space to be considered. Tempered distributions are used in Subsection 3.1, resulting from the observation that the Fourier transform of a shallow neural network involves a Dirac delta.

Some mathematical concepts are needed in order to present rigorous results. While one can argue about the necessity and relevance of these results for real-world applications, they at least explain various aspects of deep learning in restricted settings, leading to a better general understanding and intuition.

ganzuul · on June 12, 2021

For the latter maybe this? https://en.wikipedia.org/wiki/Parallel_tempering

ycreader · on June 13, 2021

Is parallel tempering related to https://en.wikipedia.org/wiki/Bennett_acceptance_ratio ?

keithalewis · on June 13, 2021

Nyet, and nyet. This is why conscientious authors define the terms they use. A tempered distribution is a linear functional on a space of differentiable functions, for example, D_x(f) = f'(x), the derivative of f at x. This is why tempered distributions cannot be composed.

In general, the dual of a space of functions is a space of set functions, aka measures. https://keithalewis.github.io/math/dual.html

fspeech · on June 12, 2021

Mostly analysis. If you understand section 1 notations, you are obviously set. But even if you don't you should still be able to get the ideas with a bit of mental translation. In a word the notation seemed unnecessarily heavy for the level of discussion.

0-_-0 · on June 12, 2021

Deep learning papers often use math in a way that obscures rather than enlightens. And when you finally understand what they are saying, you realize it's not interesting at all, or they made a mistake in the math.

julbern · on June 14, 2021

I would recommend a solid background in linear algebra, probability theory, and analysis. Moreover, for some sections, it is helpful to have experience with functional analysis, optimization, and statistical learning theory.

Some helpful resources are linked here: https://www.reddit.com/r/MachineLearning/comments/najnjg/r_t...

thanksok · on June 12, 2021

Looks like a little bit of everything except the likes of abstract algebra, logic, category theory.

These include linear algebra, graph theory, probability, algorithms, mathematical analysis, topology, differential geometry. But the most important prereqs are math maturity and mental toughness/endurance.

SilurianWenlock · on June 12, 2021

mental toughness/endurance haha!

godelski · on June 12, 2021

I skimmed it. Looks like just some basic calc and linear algebra. Nothing that crazy.