RLHF generally tends to make the model pretty confident making the entropy of it...

RLHF generally tends to make the model pretty confident making the entropy of it a not so useful predictor of when the model is going to get something wrong.

You can do a kind of sensitivity analysis to see how sensitive the output is to small perturbations of the weights... but it's computationally expensive.

Might be an interesting form of fine tuning to do distillation where the student's current sensitivity to noise (extracted from the backwards step of gradient descent) is used to shrink the predicted distribution towards uniform. It could be done very cheaply during training and perhaps could avoid the back propagation cost of doing it at inference time.