Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I was under the impression that encoder-decoder architectures were on the way of the Dodo,...

Why is that?



It's definitely a very similar method but fundamentally different in that the 'Distilling step-by-step' approach is a multi-task model.

As I understand it, rather that training the smaller model to produce the CoT/rationale as a part of the (decoded) response, it actually has two output layers. One output is for the label and the other is for the rationale. The other layers are shared, which is how/why the model is able to have an improved "understanding" of which nuances matter in the labeling task.


Decoder-only is pretty good on some tasks. Translation for example was a classic encoded-decoder task but it seems decoders can handle it.


If there's no encoder, then what is being decoded?


They actually have a performance edge, but they aren't well suited to chat models because you can't do caching of past states like with decoder-only models




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: