I'll admit, I'm entirely out of the loop when it comes to the technical aspects of how these large transformer models actually "work". Luckily, statements like "optical computers could have >8000x energy efficiency advantage" are well within my capacity to understand.
That said... Can anyone parse what's being done here? How/why is it more efficient? Is this essentially the Analog Computer of these giant transformer models?
Finally, how can I build one with fiber optic cable and leds?
The key idea is you can use physical effects such as attenuating light to implement multiplication in a way that uses much less energy than the fairly complex arrangement of digital logic necessary to implement the same operation. This is roughly the method they use in the paper:
I've seen a lab bench prototype of a different implementation, there are a lot of engineering problems to solve but as the paper points out the potential payoff is big.
Edit: The other key point is that one of the expensive components in transformers is effectively a giant matrix multiplication which implies many, many individual multiplications.
The optical stuff is still research territory and too early to be comparable to the current digital silicon. Lots of open questions about how to implement multiplication, implement activation functions, where to store weights, how to move data in and out, manufacturability of all of this, etc. Basically it's all questions.
To get some intuition about the promise imagine being able to implement the weights of a layer (fully connected layer is essentially vector matrix multiplication) as a 2D hologram and the compute as pushing an image from an OLED display through to an image sensor. Multiplication as attenuation in the hologram, summation is just the accumulation of charge on the sensor side. Everything happens all at once and the number of photons required is potentially very small. An actual working implementation would be both more clever and more practical. The potential to do every multiply in parallel for almost no energy is so attractive that I expect people to chip away at this problem for the foreseeable future.
To run an AI model, you do not need a general-purpose computer, you don't need a powerful CPU capable of executing a wide variety of code. You mostly need massive amounts of matrix multiplication, and some simpler operations.
You also don't need very high precision: many models perform well even using 8-bit floats. This allows to ditch the whole digital approach and implement analog circuits, which, while sometimes less precise, are massively simpler and more energy-efficient.
So they built a mostly-analog device specialized for running ML models, and used optics instead of electronics, which allows to make certain things much faster, on top of the simplification.
There are known attempts to use electronic approaches, such as using flash-like structures that store charge as an analog value to store model weighs, and to do addition and multiplication right inside the cells.
There are lots of fun things you can do with optical circuits. One of the most fascinating to me was running multiple threads of execution through the same circuits at the same time at different wavelengths of light! I would speculate that there are similar sorts of things going on in optical transformers, but honestly trying to get deep enough into both the ML architecture AND the optical architectures implementing them is probably a month of spare time that I don't have at the moment :-).
I know much less about the large transformer models than I do about the optical hardware but the hardware ideas have been floating around since the 1980s. The development of lasers-on-a-chip made it all a lot more realizable. E.g.
"On-chip optical matrix-vector multiplier for parallel computation" (2013)
Figure 1 has a nice visual of the vector-matrix computation in terms of the diode laser array, the multiplexer, the "microring modulator matrix" and the resulting output, the new (output) vector detection system.
> "We have designed and fabricated a prototype of a system capable of performing a multiplication of a M × N matrix A by a N × 1 vector B to give a M × 1 vector C. The mathematical procedure of MVM can be split into multiplications and additions, which is reflected in our design. Figure 1 shows a schematic of the architecture we propose. The elements of B are represented by the power of N modulated optical signals with N different wavelengths (λ1, λ2, …, λN), generated by N modulated laser diodes, either alone or together with N Mach-Zehnder modulators. These signals are multiplexed, passed through a common waveguide, and then projected onto M rows of the modulator matrix by a 1 × M optical splitter. Each element aij of matrix A is represented physically by the transmissivity of the microring modulator located in the ith row and the jth column of the modulator matrix. Each modulator in any one row only manipulates an optical signal with a specific wavelength."
That was 10 years ago, no idea what current state-of-the-art is.
That said... Can anyone parse what's being done here? How/why is it more efficient? Is this essentially the Analog Computer of these giant transformer models?
Finally, how can I build one with fiber optic cable and leds?