Not too surprising as dynamic languages take 4-20 times longer at numerical work...

eggsnbacon1 · on May 19, 2020

> like the threaded approach you're using.

thanks :) Eclipse Collections can also do batched loops which might speed this up. Telling Java how many threads you want will probably help as well

Interestingly, I tried plain old for(i) loops and the result was exactly the same. At least for Eclipse Collections, the syntactic sugar for ranges and forEach is completely optimized away, apparently.

palinkapika · on May 19, 2020

You might already know this, but there is one potential caveat with APIs like that when it comes to performance or at least measuring their performance. The hot loop (the code that actually iterates over your data points) does not live in your code base. But performance often depends on what the JIT compiler makes out of that loop. If the API is used at several locations in your program, the compiler might not be able to generate code that is optimal for your call-site and inputs. Instead, it will generate generalized code that works for all inputs but might be slower.

However, when writing benchmarks, there is often no other code around to force the JIT compiler to generate such generalized code.

The following code demonstrates this. If the parameter warmup is set, I invoke the forEach methods with different inputs first (they do the same but are different methods in the Java byte code). The purpose is to force the compiler to generate generalized code:

    @Param({"true", "false"})
    public boolean warmup;

    @Setup
    public void setup() {
        if (!warmup) {
            Interval.oneTo(20).forEach(
                    (IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
                            (IntProcedure) j -> Integer.toString(j)));
            Interval.oneTo(20).forEach(
                    (IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
                            (IntProcedure) j -> Integer.toString(j)));
            Interval.oneTo(20).forEach(
                    (IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
                            (IntProcedure) j -> Integer.toString(j)));
        }
    }

    @Benchmark
    public void coollectionsBlackhole(Blackhole blackhole) {
        Interval.oneTo(20).forEach(
                (IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
                        (IntProcedure)j -> blackhole.consume(Integer.toString(j))));
    }


    @Benchmark
    public void collectionsDeadCode() {
        Interval.oneTo(20).forEach(
                (IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
                        (IntProcedure)j -> Integer.toString(j)));
    }

And the for loop implementations for reference:

    @Benchmark
    public void loopBlackhole(Blackhole blackHole) {
        for (int j = 0; j < 20; j++) {
            for (int i = 0; i < 1_000_000; i++) {
                blackHole.consume(Integer.toString(i));
            }
        }
    }

    @Benchmark
    public void loopDeadCode() {
        for (int j = 0; j < 20; j++) {
            for (int i = 0; i < 1_000_000; i++) {
                Integer.toString(i);
            }
        }
    }

    @Benchmark
    public void loopBlackholeOnly(Blackhole hole) {
        for (int j = 0; j < 20; j++) {
            for (int i = 0; i < 1_000_000; i++) {
                hole.consume(0xCAFEBABE);
            }
        }
    }

On my desktop machine, this gives me the following results (Java 11/Hotspot/C2):

  Benchmark              (warmup)  Mode  Cnt    Score   Error  Units
  collectionsDeadCode        true  avgt   99  162.482 ± 1.615  ms/op
  collectionsDeadCode       false  avgt   99  218.816 ± 4.217  ms/op
  coollectionsBlackhole      true  avgt   99  235.122 ± 1.362  ms/op
  coollectionsBlackhole     false  avgt   99  270.192 ± 1.627  ms/op
  loopBlackhole              true  avgt   99  207.214 ± 1.162  ms/op
  loopBlackhole             false  avgt   99  206.711 ± 0.932  ms/op
  loopBlackholeOnly          true  avgt   99   74.774 ± 0.180  ms/op
  loopBlackholeOnly         false  avgt   99   74.359 ± 0.180  ms/op
  loopDeadCode               true  avgt   99  143.394 ± 0.900  ms/op
  loopDeadCode              false  avgt   99  142.654 ± 0.795  ms/op

All results are for single threaded code. Now, the difference is not huge, but significant. Overall it looks like the invocation of toString(int) is not removed and accounts for most of the runtime.

Just to be clear: I am not saying one should stay way from the stream APIs. As soon as the work per item is more than just a few arithmetic operations, there is a good chance the difference in runtime is negligible. But when doing numerical work (aggregations etc.), a simple loop might be the better option.

Finally, these differences are of course compiler-dependent. For example, C2 might behave differently than Graal and who knows what the future brings.

eggsnbacon1 · on May 19, 2020

I didn't think about this!! That could a big performance hit to pay for nice API's.

I think inlining might save us here? I'm not sure what the metrics are around inlining lambdas... Does it look at the size of the code wrapping the lambda or the size of the lambda block itself to decide inlining eligibility?

If it only considers the size of the wrapper, maybe the small forEach method will always get inlined with the block it calls?

EDIT: according to this https://stackoverflow.com/questions/44161545/can-hotspot-inl... HotSpot will inline lambda method calls as long as depth doesn't exceed maxinlinelevel (9 by default). Assuming the API designers made their methods small enough to be inlined, they should all be flattened out.

Interestingly, this maxinlinelevel default was increased to 15 in Java 14! https://bugs.openjdk.java.net/browse/JDK-8234863 . Probably because more people are using streaming/reactive API's?

palinkapika · on May 19, 2020

Yes, this is strongly related to inlining. Without the warmup code C2 most likely inlines your lambdas into the code generated for the for each method. At least I have seen this behavior with similar APIs (you can use a tool called JITWatch to visualize compiled and inlined methods). However, this does not scale.

Last time I checked C2 inlines at most two implementations at a polymorphic call-site (in this case the line that calls your lambda). If you pass in more lambda functions, it won’t inline them and might de-optimize existing compiled code to remove previously inlined functions (there are also cases where one implementation that is heavily used gets inlined and the others will be invoked via a function call). Graal does not seem to have the same limit but when I tested it two years ago it had others averaging out to the same performance.

Thus, increasing the max inline level does not necessarily help (you can already do that in Java 8 and 11) for this particular problem. What you would want is that the compiler inlines the for each method and the local lambdas into the methods using the API (the benchmark methods in our case), i.e., you want the compiler to copy the hot loop somewhere higher up in the call stack and then optimize it. But apparently this is easier said than done.

But again, this is probably only relevant for workloads where there is very little work to do for each item.

eggsnbacon1 · on May 20, 2020

I looked around and indeed this still appears to be the case. Megamorphic call sites will not be inlined.

From reading mailing list messages, the main reason for this shortcoming appears to be C2 is such a mess nobody wants to touch it and they're writing a new compiler instead (Graal).

If optimizations start getting targeted towards Graal I'll be pissed. Oracle gimps the performance of the free version, the one thing you should never ever do with a programming langauge

palinkapika · on May 19, 2020

And of course I screwed up the warmup parameter...

gergo_barany · on May 20, 2020

> For example, C2 might behave differently than Graal and who knows what the future brings.

Yes, the GraalVM compiler is more aggressive on some optimizations that can be especially helpful with streams. Here are numbers from my machine (using ops/s, so higher is better!):

C2 (on Java(TM) SE Runtime Environment (build 1.8.0_251-b08)):

    Benchmark                          (warmup)   Mode  Cnt   Score   Error  Units
    MyBenchmark.collectionsDeadCode        true  thrpt   10   2.179 ± 0.070  ops/s
    MyBenchmark.collectionsDeadCode       false  thrpt   10   1.640 ± 0.055  ops/s
    MyBenchmark.coollectionsBlackhole      true  thrpt   10   1.661 ± 0.109  ops/s
    MyBenchmark.coollectionsBlackhole     false  thrpt   10   1.571 ± 0.062  ops/s
    MyBenchmark.loopBlackhole              true  thrpt   10   1.829 ± 0.077  ops/s
    MyBenchmark.loopBlackhole             false  thrpt   10   1.855 ± 0.062  ops/s
    MyBenchmark.loopBlackholeOnly          true  thrpt   10  19.775 ± 0.184  ops/s
    MyBenchmark.loopBlackholeOnly         false  thrpt   10  20.213 ± 0.491  ops/s
    MyBenchmark.loopDeadCode               true  thrpt   10   2.022 ± 0.057  ops/s
    MyBenchmark.loopDeadCode              false  thrpt   10   1.990 ± 0.074  ops/s

GraalVM Community Edition (current master, close to yesterday's 20.1 release, JDK 8):

    Benchmark                          (warmup)   Mode  Cnt   Score   Error  Units
    MyBenchmark.collectionsDeadCode        true  thrpt   10   3.151 ± 0.118  ops/s
    MyBenchmark.collectionsDeadCode       false  thrpt   10   2.790 ± 0.053  ops/s
    MyBenchmark.coollectionsBlackhole      true  thrpt   10   2.713 ± 0.137  ops/s
    MyBenchmark.coollectionsBlackhole     false  thrpt   10   2.488 ± 0.031  ops/s
    MyBenchmark.loopBlackhole              true  thrpt   10   2.862 ± 0.073  ops/s
    MyBenchmark.loopBlackhole             false  thrpt   10   2.835 ± 0.043  ops/s
    MyBenchmark.loopBlackholeOnly          true  thrpt   10  24.396 ± 1.278  ops/s
    MyBenchmark.loopBlackholeOnly         false  thrpt   10  24.103 ± 0.957  ops/s
    MyBenchmark.loopDeadCode               true  thrpt   10   3.235 ± 0.097  ops/s
    MyBenchmark.loopDeadCode              false  thrpt   10   3.205 ± 0.168  ops/s

GraalVM Enterprise Edition (current master, close to 20.1, JDK 8):

    Benchmark                          (warmup)   Mode  Cnt   Score   Error  Units
    MyBenchmark.collectionsDeadCode        true  thrpt   10   3.748 ± 0.046  ops/s
    MyBenchmark.collectionsDeadCode       false  thrpt   10   2.879 ± 0.042  ops/s
    MyBenchmark.coollectionsBlackhole      true  thrpt   10   2.792 ± 0.059  ops/s
    MyBenchmark.coollectionsBlackhole     false  thrpt   10   2.479 ± 0.040  ops/s
    MyBenchmark.loopBlackhole              true  thrpt   10   3.070 ± 0.046  ops/s
    MyBenchmark.loopBlackhole             false  thrpt   10   3.077 ± 0.057  ops/s
    MyBenchmark.loopBlackholeOnly          true  thrpt   10  25.200 ± 0.760  ops/s
    MyBenchmark.loopBlackholeOnly         false  thrpt   10  25.752 ± 0.481  ops/s
    MyBenchmark.loopDeadCode               true  thrpt   10   3.812 ± 0.161  ops/s
    MyBenchmark.loopDeadCode              false  thrpt   10   3.909 ± 0.109  ops/s

Amusingly, on the OP's benchmark GraalVM also does better than C2, and it somehow (I haven't tried to analyze this yet) makes the serial stream version as fast as the parallel one:

C2:

    Benchmark                            Mode  Cnt  Score   Error  Units
    MyBenchmark.originalParallelStream  thrpt   10  2.547 ± 0.336  ops/s
    MyBenchmark.originalSerialStream    thrpt   10  1.719 ± 0.041  ops/s

GraalVM Community Edition:

    Benchmark                            Mode  Cnt  Score   Error  Units
    MyBenchmark.originalParallelStream  thrpt   10  2.809 ± 0.141  ops/s
    MyBenchmark.originalSerialStream    thrpt   10  2.868 ± 0.090  ops/s

(I work in the GraalVM compiler team at Oracle Labs, I don't speak for Oracle, I'm not claiming anything about whether this is a good or bad benchmark, etc.)

palinkapika · on May 20, 2020

The results for the parallel code are indeed interesting. However, the speedup < 2 with C2 is suspicious. Even on an old dual core / 4 thread system (common worker pool should default to 3 workers), I would expect it to be above 2 for this kind of workload.

The last time I looked into Graal (over a year ago), I ran into something I could not make sense of.

We were designing an API with methods similar to Arrays.setAll(double[], IntToDoubleFunction) which we expected to be called with large input arrays from many places in the code base.

The performance of the code generated by C2 dropped as one would expect once you were using more that two functions (lambda functions were no longer inlined and invoked for each element). For simple arithmetic operations the runtime increased by around 5x.

The performance of the code generated by Graal was competitive with C2's code for mono- and bimorphic scenarios for a much larger number of input functions. I don't recall the exact number, but I believe it was around 25. However, once we hit that threshold, performance tanked. I believe the difference was closer to 15x than than C2's 5x.

We never figured out what caused this.

Do you have an idea? Or, since this is probably outdated data, do you know what the expected behavior is nowadays? Is there maybe any documentation on this?

gergo_barany · on May 20, 2020

> However, the speedup < 2 with C2 is suspicious. Even on an old dual core / 4 thread system (common worker pool should default to 3 workers), I would expect it to be above 2 for this kind of workload.

The speedup I found on C2 is 2.547 / 1.719 ~= 1.5, the OP's speedup (also on C2) was 0.71 s / 0.38 s ~= 1.9. Not the same factor, but not above 2 either. The benchmark is quite small, so the overhead of setting up the parallel computation might matter. The JDK version might matter as well.

> Graal was competitive with C2's code for mono- and bimorphic scenarios for a much larger number of input functions. I don't recall the exact number, but I believe it was around 25.

If we are talking about a call like Arrays.setAll(array, function) and you have one or two candidates for function, then you are mono- or bimorphic. If you have 25 candidates for function, you are highly polymorphic. I don't understand what you mean by a mono- or bimorphic scenario with up to 25 candidates.

Performance will usually decrease as you transition from a call site with few candidates to one with many candidates. I agree that a factor of 15x looks steep, but it's hard to say more without context. The inliners (Community and Enterprise use different ones) are constantly being tweaked, though that's not my area of expertise. It's very possible that this behaves differently than it did more than a year ago.

If you have a concrete JMH benchmark you can link to and are willing to post to the GraalVM Slack (see signup link from https://www.graalvm.org/community/), that would be great. Alternatively, you could drop me a line at <first . last at oracle dot com>, and I could have a look at a benchmark, without promising anything.

palinkapika · on May 20, 2020

> If we are talking about a call like Arrays.setAll(array, function) and you have one or two candidates for function, then you are mono- or bimorphic. If you have 25 candidates for function, you are highly polymorphic. I don't understand what you mean by a mono- or bimorphic scenario with up to 25 candidates.

Sorry, I phrased that poorly. What I was trying to say was that the moment we introduced a 3rd candidate the performance with C2 decreased by ~5x. Whereas with Graal the performance stayed virtually the same when introducing a 3rd or 4th candidate. And it was performing well. But it did decrease at some point after adding more and more candidates and then the performance hit was much bigger.

I will reach out to my colleagues still working on this next week (long weekend ahead where I live). If we can reproduce this with a current Graal release, I'll share the benchmark – always happy to learn something new and if it is 'just' why out benchmark is broken. :)

eggsnbacon1 · on May 21, 2020

I really appreciate the in-depth replies on HN. Something Reddit lost ages ago. I gotta say your posts have introduced me to a lot of dark magic in the JVM that isn't really documented anywhere unless you count source code. Much appreciated!

eggsnbacon1 · on May 20, 2020

Interesting indeed! I'll try to avoid a long rant, but I won't go near Graal because Oracle is stuffing much of the performance optimizations behind a paywall (wtf). Who writes a compiler and makes the code slower for free version? That's like something MS would do in 1995.

If Graal does replace OpenJDK eventually and Oracle keeps the performance optimizations behind an enterprise license, I have a feeling the Java community will fork the project away from Oracle. Just like they did for OpenOffice, MySQL, and Jenkins.

Another comment about Graal. Since performance is so important, its likely a third-party like Redhat will want to merge code into community edition that does the same thing as code already present in an "official version". Like RedHat already did for GC with Shenandoah. What is Oracle going to do? Say "no because it competes with our commercial branch"? I think whats going to happen is what happened with Shenandoah, where the Oracle builds are crippled by excluding features. What happened there? 99% of people switched to AdoptOpenJDK.

I'm not attacking you here, I realize you have no control over the matter. And thanks for Graal, its super impressive! I'm just salty that the community decided to replace C2 a decade ago yet Oracle has decided not to release the full replacement back to this community. Many companies differentiate themselves by having commercial versions of tools that perform better. Thats fine with me. I'm salty because Graal was how the community decided to replace C2. Its the reference implementation. Now the main benefit of that replacement, better performance, is locked behind a paywall. This is not what anyone intended and had they known this would happen they probably just would have rewritten C2 and left it in Hotspot