So is this just the k-nearest neighbour approach with the Wasserstein distance as the measure? Is it the efficient implementation of the Wasserstein distance that makes this outcome interesting? Or is it the use of a non-parametric model in image classification?
The way I read the article (have t read the paper) the methods are very similar in concept, in that they are using word embeddings.
Basically instead of training on Wikipedia, they train on the categories assigned by Flickr users. And using a different word embedding algorithm than word2vec does, but the principal is similar imho