And, thanks to scale, that some of the clusters corresponds to high level concepts. According to the article, earlier attempts have mostly resulted in low level concepts like "edge" or "blob" to be detected.
Also, it was (again, from the article) plausible but not a given that high level concepts could be found from unlabeled data.
That "cat" is one of the high level concept you get from using random Youtube videos as raw data is both impressive, and slightly amusing.
Exactly. Regarding your remark about edge detectors: such self-organizing neural nets are organized into hierarchical layers, and early layers' units are going to learn to become detectors of statistically common components of the input image, in the same way as the initial layers of the visual system perform blob and edge detection (retina, lateral geniculate nucleus, V1). In mathematical terms, these early units learn the conditional principal components of the inputs. The layers that are built upon these detectors, if correctly organized, are going to build upon this initial abstraction and learn more complex features: for instance to find these these edges in relative positions (to each other). Eventually, up the abstraction chain, units detect such statistically frequent features as the shape of cat's ears (common in youtube videos, I imagine), etc...
(Sorry I wrote that fast, I hope it's understandable)
The performance is due more to the architecture than to the scale. The scale is to handle all that data. But the feature learning performance has to do with their layered sparse learning technique which is brilliant. Although, their autoencoder neural network is actually learning a decomposition on the data so the neural part is kind of a red herring.
That you get high level features instead of edges is not the impressive part - you can just as well write a sparse non negative matrix factorization algorithm that will efficiently learn/represent eyes, lips and noses as features of faces unsupervised.
Clustering algorithms operate on features, which typically have to be designed by hand. The appeal of deep learning is that it discovers good features automatically.
Feature sensitivity is typically hand-crafted only because it's the practical thing to do. Neural nets can easily learn visual features. See the LISSOM neural nets for a good example of self-organized learning of features.
It's not revolutionary. Clustering algorithms and neural nets are plenty.
Really, what differentiates this network is its scale.