thanks :)
Eclipse Collections can also do batched loops which might speed this up. Telling Java how many threads you want will probably help as well
Interestingly, I tried plain old for(i) loops and the result was exactly the same. At least for Eclipse Collections, the syntactic sugar for ranges and forEach is completely optimized away, apparently.
You might already know this, but there is one potential caveat with APIs like that when it comes to performance or at least measuring their performance. The hot loop (the code that actually iterates over your data points) does not live in your code base. But performance often depends on what the JIT compiler makes out of that loop. If the API is used at several locations in your program, the compiler might not be able to generate code that is optimal for your call-site and inputs. Instead, it will generate generalized code that works for all inputs but might be slower.
However, when writing benchmarks, there is often no other code around to force the JIT compiler to generate such generalized code.
The following code demonstrates this. If the parameter warmup is set, I invoke the forEach methods with different inputs first (they do the same but are different methods in the Java byte code). The purpose is to force the compiler to generate generalized code:
@Param({"true", "false"})
public boolean warmup;
@Setup
public void setup() {
if (!warmup) {
Interval.oneTo(20).forEach(
(IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
(IntProcedure) j -> Integer.toString(j)));
Interval.oneTo(20).forEach(
(IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
(IntProcedure) j -> Integer.toString(j)));
Interval.oneTo(20).forEach(
(IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
(IntProcedure) j -> Integer.toString(j)));
}
}
@Benchmark
public void coollectionsBlackhole(Blackhole blackhole) {
Interval.oneTo(20).forEach(
(IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
(IntProcedure)j -> blackhole.consume(Integer.toString(j))));
}
@Benchmark
public void collectionsDeadCode() {
Interval.oneTo(20).forEach(
(IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
(IntProcedure)j -> Integer.toString(j)));
}
And the for loop implementations for reference:
@Benchmark
public void loopBlackhole(Blackhole blackHole) {
for (int j = 0; j < 20; j++) {
for (int i = 0; i < 1_000_000; i++) {
blackHole.consume(Integer.toString(i));
}
}
}
@Benchmark
public void loopDeadCode() {
for (int j = 0; j < 20; j++) {
for (int i = 0; i < 1_000_000; i++) {
Integer.toString(i);
}
}
}
@Benchmark
public void loopBlackholeOnly(Blackhole hole) {
for (int j = 0; j < 20; j++) {
for (int i = 0; i < 1_000_000; i++) {
hole.consume(0xCAFEBABE);
}
}
}
On my desktop machine, this gives me the following results (Java 11/Hotspot/C2):
All results are for single threaded code. Now, the difference is not huge, but significant. Overall it looks like the invocation of toString(int) is not removed and accounts for most of the runtime.
Just to be clear: I am not saying one should stay way from the stream APIs. As soon as the work per item is more than just a few arithmetic operations, there is a good chance the difference in runtime is negligible. But when doing numerical work (aggregations etc.), a simple loop might be the better option.
Finally, these differences are of course compiler-dependent. For example, C2 might behave differently than Graal and who knows what the future brings.
I didn't think about this!! That could a big performance hit to pay for nice API's.
I think inlining might save us here? I'm not sure what the metrics are around inlining lambdas... Does it look at the size of the code wrapping the lambda or the size of the lambda block itself to decide inlining eligibility?
If it only considers the size of the wrapper, maybe the small forEach method will always get inlined with the block it calls?
EDIT: according to this https://stackoverflow.com/questions/44161545/can-hotspot-inl...
HotSpot will inline lambda method calls as long as depth doesn't exceed maxinlinelevel (9 by default). Assuming the API designers made their methods small enough to be inlined, they should all be flattened out.
Yes, this is strongly related to inlining. Without the warmup code C2 most likely inlines your lambdas into the code generated for the for each method. At least I have seen this behavior with similar APIs (you can use a tool called JITWatch to visualize compiled and inlined methods). However, this does not scale.
Last time I checked C2 inlines at most two implementations at a polymorphic call-site (in this case the line that calls your lambda). If you pass in more lambda functions, it won’t inline them and might de-optimize existing compiled code to remove previously inlined functions (there are also cases where one implementation that is heavily used gets inlined and the others will be invoked via a function call). Graal does not seem to have the same limit but when I tested it two years ago it had others averaging out to the same performance.
Thus, increasing the max inline level does not necessarily help (you can already do that in Java 8 and 11) for this particular problem. What you would want is that the compiler inlines the for each method and the local lambdas into the methods using the API (the benchmark methods in our case), i.e., you want the compiler to copy the hot loop somewhere higher up in the call stack and then optimize it. But apparently this is easier said than done.
But again, this is probably only relevant for workloads where there is very little work to do for each item.
I looked around and indeed this still appears to be the case. Megamorphic call sites will not be inlined.
From reading mailing list messages, the main reason for this shortcoming appears to be C2 is such a mess nobody wants to touch it and they're writing a new compiler instead (Graal).
If optimizations start getting targeted towards Graal I'll be pissed. Oracle gimps the performance of the free version, the one thing you should never ever do with a programming langauge
> For example, C2 might behave differently than Graal and who knows what the future brings.
Yes, the GraalVM compiler is more aggressive on some optimizations that can be especially helpful with streams. Here are numbers from my machine (using ops/s, so higher is better!):
C2 (on Java(TM) SE Runtime Environment (build 1.8.0_251-b08)):
Amusingly, on the OP's benchmark GraalVM also does better than C2, and it somehow (I haven't tried to analyze this yet) makes the serial stream version as fast as the parallel one:
(I work in the GraalVM compiler team at Oracle Labs, I don't speak for Oracle, I'm not claiming anything about whether this is a good or bad benchmark, etc.)
The results for the parallel code are indeed interesting. However, the speedup < 2 with C2 is suspicious. Even on an old dual core / 4 thread system (common worker pool should default to 3 workers), I would expect it to be above 2 for this kind of workload.
The last time I looked into Graal (over a year ago), I ran into something I could not make sense of.
We were designing an API with methods similar to Arrays.setAll(double[], IntToDoubleFunction) which we expected to be called with large input arrays from many places in the code base.
The performance of the code generated by C2 dropped as one would expect once you were using more that two functions (lambda functions were no longer inlined and invoked for each element). For simple arithmetic operations the runtime increased by around 5x.
The performance of the code generated by Graal was competitive with C2's code for mono- and bimorphic scenarios for a much larger number of input functions. I don't recall the exact number, but I believe it was around 25. However, once we hit that threshold, performance tanked. I believe the difference was closer to 15x than than C2's 5x.
We never figured out what caused this.
Do you have an idea? Or, since this is probably outdated data, do you know what the expected behavior is nowadays? Is there maybe any documentation on this?
> However, the speedup < 2 with C2 is suspicious. Even on an old dual core / 4 thread system (common worker pool should default to 3 workers), I would expect it to be above 2 for this kind of workload.
The speedup I found on C2 is 2.547 / 1.719 ~= 1.5, the OP's speedup (also on C2) was 0.71 s / 0.38 s ~= 1.9. Not the same factor, but not above 2 either. The benchmark is quite small, so the overhead of setting up the parallel computation might matter. The JDK version might matter as well.
> Graal was competitive with C2's code for mono- and bimorphic scenarios for a much larger number of input functions. I don't recall the exact number, but I believe it was around 25.
If we are talking about a call like Arrays.setAll(array, function) and you have one or two candidates for function, then you are mono- or bimorphic. If you have 25 candidates for function, you are highly polymorphic. I don't understand what you mean by a mono- or bimorphic scenario with up to 25 candidates.
Performance will usually decrease as you transition from a call site with few candidates to one with many candidates. I agree that a factor of 15x looks steep, but it's hard to say more without context. The inliners (Community and Enterprise use different ones) are constantly being tweaked, though that's not my area of expertise. It's very possible that this behaves differently than it did more than a year ago.
If you have a concrete JMH benchmark you can link to and are willing to post to the GraalVM Slack (see signup link from https://www.graalvm.org/community/), that would be great. Alternatively, you could drop me a line at <first . last at oracle dot com>, and I could have a look at a benchmark, without promising anything.
> If we are talking about a call like Arrays.setAll(array, function) and you have one or two candidates for function, then you are mono- or bimorphic. If you have 25 candidates for function, you are highly polymorphic. I don't understand what you mean by a mono- or bimorphic scenario with up to 25 candidates.
Sorry, I phrased that poorly. What I was trying to say was that the moment we introduced a 3rd candidate the performance with C2 decreased by ~5x. Whereas with Graal the performance stayed virtually the same when introducing a 3rd or 4th candidate. And it was performing well. But it did decrease at some point after adding more and more candidates and then the performance hit was much bigger.
I will reach out to my colleagues still working on this next week (long weekend ahead where I live). If we can reproduce this with a current Graal release, I'll share the benchmark – always happy to learn something new and if it is 'just' why out benchmark is broken. :)
I really appreciate the in-depth replies on HN. Something Reddit lost ages ago. I gotta say your posts have introduced me to a lot of dark magic in the JVM that isn't really documented anywhere unless you count source code. Much appreciated!
Interesting indeed!
I'll try to avoid a long rant, but I won't go near Graal because Oracle is stuffing much of the performance optimizations behind a paywall (wtf). Who writes a compiler and makes the code slower for free version? That's like something MS would do in 1995.
If Graal does replace OpenJDK eventually and Oracle keeps the performance optimizations behind an enterprise license, I have a feeling the Java community will fork the project away from Oracle. Just like they did for OpenOffice, MySQL, and Jenkins.
Another comment about Graal. Since performance is so important, its likely a third-party like Redhat will want to merge code into community edition that does the same thing as code already present in an "official version". Like RedHat already did for GC with Shenandoah. What is Oracle going to do? Say "no because it competes with our commercial branch"? I think whats going to happen is what happened with Shenandoah, where the Oracle builds are crippled by excluding features. What happened there? 99% of people switched to AdoptOpenJDK.
I'm not attacking you here, I realize you have no control over the matter. And thanks for Graal, its super impressive! I'm just salty that the community decided to replace C2 a decade ago yet Oracle has decided not to release the full replacement back to this community. Many companies differentiate themselves by having commercial versions of tools that perform better. Thats fine with me. I'm salty because Graal was how the community decided to replace C2. Its the reference implementation. Now the main benefit of that replacement, better performance, is locked behind a paywall. This is not what anyone intended and had they known this would happen they probably just would have rewritten C2 and left it in Hotspot
Java/c/c#/c++/rust etc are roughly at the same end of the spectrum (unless you're creating stupid numbers of objects).
Perl/python/ruby, they're dynamic, so expect slower results.
I like the threaded approach you're using.