Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sharing some personal thoughts:

Publishing a paper can be an imperfect, but useful 'expensive signal' that increases visibility. For vqsort, we initiated a collaboration with two universities who had published papers: KIT's ips4o for parallel sorting; Jena's fast AVX2 partition and 2D sorting network.

It needn't be AVX-512, but I'd consider AVX2 to be table stakes for any sorting algorithm. It's not clear that scalar techniques transfer to vector approaches (for example, multi-pivot partitioning did not help), and the speedup from AVX2 vs scalar is at least 3x.

It's not clear to me that special-casing already-sorted inputs is a net positive in practice outside of benchmarks. Sure, we can early-out as soon as a difference is found, but we're still paying for potentially an entire scan through the array, or a slowdown in partitioning (we found that even tracking the min/max had a noticeable cost). This can instead be done at the level of callers, who can then remember that this dataset is sorted and even skip the is-sorted check subsequently.

In-place seems helpful for benefiting from limited-size but faster memory such as L3 or even HBM.

Finally, why C? As you mention, AVX-512 is not ubiquitous, and we'd prefer portability over duplicating several thousand LOC for each instruction set. That's much easier (via Highway) in C++. Would a C-compatible FFI be enough?



pdqsort paper: https://arxiv.org/abs/2106.05123

fluxsort compiles in C++ just fine.

Pre-sorted data is just the simplest example of adaptivity. Partially-sorted input can't be tracked easily by the caller and I believe that something like merging a few sorted arrays by concatenating then sorting is pretty common. Fluxsort, glidesort, ipnsort, vergesort all have methods that allow for merging such patterns quickly. According to the benchmarks in the article, many are also doing better at low-cardinality data.

If you don't consider any of this adaptivity useful, it's very strange to compare to std::sort which does make some attempt to be adaptive. And strange that your documentation wouldn't mention the difference. But we all know why you're doing that. I suppose you'll keep your fingers in your ears regarding 32-bit radix sort, which has benchmarked at 1450MB/s for a million elements (2.77 ns/v) on M1: https://github.com/mlochbaum/rhsort/issues/2#issuecomment-15... .

The fulcrum partition from crumsort is an in-place adaptation of fluxsort's out-of-place partition, which is why I mentioned it. It's faster when things stop fitting in L2 in my testing.


> According to the benchmarks in the article, many are also doing better at low-cardinality data.

VQSort handles low cardinality quite well. Unfortunately only a few benchmarks in the article (not including that one) were re-run after our performance bug fix.

> it's very strange to compare to std::sort which does make some attempt to be adaptive.

We chose std::sort mainly because it seems to be widely used.

> But we all know why you're doing that. I suppose you'll keep your fingers in your ears regarding 32-bit radix sort

huh, I'm curious what's the background or context for this remark? FYI I published a fast 32-bit radix sort in 2010: https://arxiv.org/abs/1008.2849 The trend has been towards less and less memory bandwidth per core, and we are often seeing servers bottlenecked on that.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: