> I was under the impression that encoder-decoder architectures were on the way ...

joemoon · on Sept 22, 2023

It's definitely a very similar method but fundamentally different in that the 'Distilling step-by-step' approach is a multi-task model.

As I understand it, rather that training the smaller model to produce the CoT/rationale as a part of the (decoded) response, it actually has two output layers. One output is for the label and the other is for the rationale. The other layers are shared, which is how/why the model is able to have an improved "understanding" of which nuances matter in the labeling task.

janalsncm · on Sept 22, 2023

Decoder-only is pretty good on some tasks. Translation for example was a classic encoded-decoder task but it seems decoders can handle it.

mike_hearn · on Sept 23, 2023

If there's no encoder, then what is being decoded?

euclaise · on Sept 23, 2023

They actually have a performance edge, but they aren't well suited to chat models because you can't do caching of past states like with decoder-only models