OpenEquivariance [1] is another good baseline for with kernels for the Clebsch-Gordon tensor product and convolution, and it is fully open source. Both kernel implementations have been successfully implemented into existing machine learning interatomic potentials, e.g. [2,3].
A related line of work is "Thinking Like Transformers" [1]. They introduce a primitive programming language, RASP, which is composed of operations capable of being modeled with transformer components, and demonstrate how different programs can be written with it, e.g. histograms, sorting. Sasha Rush and Gail Weiss have an excellent blog post on it as well [2]. Follow on work actually demonstrated how RASP-like programs could actually be compiled into model weights without training [3].
Huge fan of RASP et al. If you enjoy this space, might be fun to take a glance at some of my work on HandCrafted Transformers [1] wherein I hand-pick the weights in a transformer model to do long-handed addition similar to how humans learn to do it in gradeshcool.
Link to paper [1]. It looks like the authors construct a basic autoencoder to predict frames in videos of various videos (double pendulum, lava lamp, etc.) and then use the Levina–Bickel algorithm [2] to determine the expected "intrinsic dimension" of the latent space of the autoencoder. They then refer to the size of the intrinsic dimension as the "minimum number of variables required by the system to accurately capture the motion.", e.g. 24 for a video of a fireplace.
Personally, I wonder how much information this actually provides about a system. Since the neural network is non-linear, a single latent variable may theoretically function as more than one state variable.
Chess experts are only good at remembering valid chess positions that could result from a game in play. Their "expertise" vanishes why trying to remember chess boards set up at random (invalid) positions.
It could be that there are a few valid states that can be highly encoded in each of these situations, and the autoencoders found them.
According to [1], the byte pair encoding for “Apoploe vesrreaitais” (the words producing bird images) is "apo, plo, e</w>, ,ve, sr, re, ait, ais</w>", and Apo-didae & Plo-ceidae are families of birds.
On the other hand the openai tokenizer gives me a different tokenization ap - opl - oe [0]. If you capitalize A the result is A - pop - loe. The dalle 2 paper only specifies that it uses a BPE encoding, I would assume they used the same one as for gpt3
[0] https://beta.openai.com/tokenizer
I wonder how their advertised "vector database" works. kNN combined with embeddings from pre-trained deep learning models can be very useful for information retrieval, (e.g. searching for duplicate/similar images or text).
In the past I have used a k-d tree [1] for this, which allows O(log n) searches in the vector space. It seems they are offering a k-d-tree-as-a-service.
Pinecone stores and searches through dense vector embeddings using a proprietary ANN index. It also has live index updates and metadata filtering, which you’d expect from any database but is surprisingly hard to find or do with vector indexes.
As you said, common use cases include deduplication and image search, and especially semantic search (text).
One quick note: k-d trees are great for indexing low-dimensional data, but for high-dimensional embeddings they tend to be a poor indexing choice since you'll end up visiting more nodes in the tree than you'd like. I found [1] to be a great overview of different indexing types for high-dimensional vectors and the advantages of each.
For image retrieval, have you tried using a model trained with contrastive learning (e.g. SimCLR)? This could produce better embeddings for retrieval since the model is trained to explicitly minimize euclidean distance between similar pairs.
Thanks for the reference! Nice outline of various ANN approaches.
I haven't tried SimCLR, but I did try face embedding models trained with contrastive and triplet loss. For applications where precision is the key metric, I do agree that these loss functions are much better overall.
If discovery or recall is what you're after, a generic image classification model trained with binary cross-entropy might be better. For example, performing reverse image search on a photo of a German Shepherd should always return images of GSheps in the first N pages, but showing other dog breeds in later pages and possibly even cats after that would be a desirable feature for many search/retrieval solutions. An embedding model trained with contrastive loss might have this behavior to a certain extent, but a model based on BCE should be better.
Einops looks nice! It reminds me of https://github.com/deepmind/einshape which is another attempt at unifying reshape, squeeze, expand_dims, transpose, tile, flatten, etc under an einsum-inspired DSL.
Somebody also realized that much of the time you can use one single function to describe all 3 of the einops operations. I present to you, einop: https://github.com/cgarciae/einop
About a year ago I read this and attempted to build a similar model (with more layers ;)) using data I scraped from Hong Kong Jockey Club’s website. Although I used much fewer features, it still produced profit in held-out races: https://teddykoker.com/2019/12/beating-the-odds-machine-lear.... Obviously there are many caveats when backtesting like this but I thought it was a fun project!
I had a coworker who would prepare for weeks for stakes races and follow a few second tier horses as well. Twitter made some aspects easier as there is a racetrack Twitter community. His specialty was identifying exactas where a long shot would place or win with a favorite.
He’d pay people to film workout at Belmont and Saratoga and tweak his model (an Excel spreadsheet) based on what he saw. He would have a sense based on the workouts, weather, etc and would pick 4-10 races a week.
[1] https://github.com/PASSIONLab/OpenEquivariance
[2] https://arxiv.org/abs/2504.16068
[3] https://arxiv.org/abs/2508.16067
reply