It's definitely a very similar method but fundamentally different in that the 'Distilling step-by-step' approach is a multi-task model.
As I understand it, rather that training the smaller model to produce the CoT/rationale as a part of the (decoded) response, it actually has two output layers. One output is for the label and the other is for the rationale. The other layers are shared, which is how/why the model is able to have an improved "understanding" of which nuances matter in the labeling task.
They actually have a performance edge, but they aren't well suited to chat models because you can't do caching of past states like with decoder-only models
Why is that?