All primitive performance evaluations should be multidimensional, IMO, with fairness being one of the highest-priority variables.
The importance of fairness varies with the algorithm. For instance, one that alternates between being CPU-bound and disk-bound may use N_workers >>> N_cores. These workers might not care about individual starvation at all.
At the other end, consider a strictly CPU-bound algorithm with N_workers==N_cores, and a rogue-ish worker that ends up hogging an important mutex by spinning around it (e.g. the one written by author of the benchmark.) If the other well-behaved workers briefly lock the mutex once prior to each job, but end up getting woken up only once every 30 locks (as this implementation apparently does,) you might end up with core utilization <<< 100%.
This in turn means that APIs that let you customize the behavior can win out even if they're not #1 on any benchmark.
The importance of fairness varies with the algorithm. For instance, one that alternates between being CPU-bound and disk-bound may use N_workers >>> N_cores. These workers might not care about individual starvation at all.
At the other end, consider a strictly CPU-bound algorithm with N_workers==N_cores, and a rogue-ish worker that ends up hogging an important mutex by spinning around it (e.g. the one written by author of the benchmark.) If the other well-behaved workers briefly lock the mutex once prior to each job, but end up getting woken up only once every 30 locks (as this implementation apparently does,) you might end up with core utilization <<< 100%.
This in turn means that APIs that let you customize the behavior can win out even if they're not #1 on any benchmark.