Encountered it recently. I had two different dataset to evaluate model performan...

Encountered it recently. I had two different dataset to evaluate model performance on from different domains.

One dataset was closer to training data and the other was closer to our business use case. The hypothesis was that performance on the latter dataset would be poorer due to overfitting.

Indeed the accuracy on all categories had reduced. However, overall accuracy was much higher!

This was because the second dataset had higher frequency of easy to predict categories.

If we had just looked at overall number we would have concluded that there was no overfitting to train domain, which was not the case.