This whole "the JVM inlines things for you and avoids the allocation of objects ...

pjmlp · on June 12, 2022

Android Java is not like proper Java, including how ART works. At least they dropped Dalvik which was comparable to early JVM implementations pre-JIT days, think Java 1.2.

Then there is the whole issue that Google isn't that up the game into making the same kind of optimizations as the Java community, and rather calls into C++ via native methods than optimizing ART more than good enough.

Finally the sore state of proper Java support comes from their beef with Sun and Oracle initially, and now their agenda to push Kotlin no matter what, and every modern Java feature that becomes one red less against Kotlin doesn't help.

Hence why most of their Java vs Kotlin samples are mostly based on Java 7.

If the Android team really cared, they would provide equal support for both languages and then let the community pick up which one they were found of.

Even the recently announced improved support for Android 13 is based on a Java 11 LTS subset, when the more recent LTS version is 17, and Java 19 is a couple of months away.

vips7L · on June 12, 2022

Fwiw, as a Java fanboy, I do think Kotlin/Compose is way better for creating GUI’s than what Java currently has to offer. I recently went and tried some swing again and it wasn’t enjoyable, but like you said it’s like this for a reason.

pjmlp · on June 12, 2022

Yeah, let Google pretend Groovy didn't do it first.

https://groovy-lang.org/swing.html

tpxl · on June 12, 2022

IIRC, the 'modern' way to create java GUIs is using JavaFX, which should have better development experience.

vips7L · on June 12, 2022

It’s a little better, but I still don’t think it’s on par with compose.

layer8 · on June 12, 2022

> The idea that you should allocate as few short-lived objects as possible has stuck with me.

I don’t know about Android, but since generational garbage collectors have become the default, the rule is that short-lived objects are dirt cheap, because they’ll be collected all at once in O(1) with the young generation. And since allocating them is usually just a pointer increment, they can be as cheap as stack objects.

snek_case · on June 12, 2022

> since allocating them is usually just a pointer bump, they can be as cheap as stack objects.

I don't think that's quite true. Even with a copying/moving GC, you still need to traverse all the live objects and then copy all of them. Asymptotically as cheap as stack objects maybe, but in reality the overhead is greater. GCs also have some amount of overhead due to all the synchronization they need to do.

layer8 · on June 12, 2022

> you still need to traverse all the live objects and then copy all of them.

The live objects traversed are generally not short-lived objects, because those aren’t live anymore! Meaning, short-lived objects typically aren’t traversed and therefore don’t contribute to the GC cost like longer-lived objects do.

dc-programmer · on June 12, 2022

I’m curious how the JVM efficiently allocates many small objects. How does it avoid memory fragmentation? Or does it request a large block of memory from the OS at the start? (last q is probably easily googled)

layer8 · on June 12, 2022

The JVM allocates small objects by incrementing a pointer in the “young generation” region. The GC later moves all objects that are still alive from that region to a different region. The “young” region can then be reused from scratch. The moving of objects effectively defragments (compacts) that region of the heap. Modern GCs use multiple per-thread and/or per-core regions, i.e. there are generally multiple “young” regions, not just a single one. Memory is allocated from the OS in large chunks.

There is intermediate fragmentation due to dead-but-not-yet-collected objects. Together with the use of different generational regions, GC languages thus require more memory (a rule of thumb is twice the memory of a non-GC program), but memory is cheap, and not having to reference-count and deallocate each object individually can conversely have performance benefits.

dc-programmer · on June 12, 2022

Ok thanks, that helped clear some things up. I didn’t realize generational GC reuses the same contiguous memory blocks for the young regions. Makes sense from a fragmentation and resident page locality perspective.

Do you know if the early generation region settings can be tweaked (alignment, size, number of threads/regions)? I’m wondering what happens if you “overflow” these areas by generating too many objects

layer8 · on June 12, 2022

There are many parameters you can tweak, for sure. Overflowing an area probably triggers an immediate GC run, and if that doesn’t free up enough space, an additional “young” region is used. You only need to specify a high enough maximum value for the total memory consumption of all generations.

One situation you can run into is if you generate and quickly drop objects faster than the GC can collect them, or rather, the process spends most of its time with GC rather than with actual program execution, then after a certain while an exception will be raised. That situation almost always indicates a bug in your program, and the exception helps to see the cause of the program stalling. Of course that behavior is also configurable.

dc-programmer · on June 12, 2022

Excellent. I appreciate the information. I’m just getting into the JVM dev ecosystem so I have a lot to learn

tpxl · on June 12, 2022

The latest GCs are _fast_. I benchmarked ZGC and IIRC it ate up something like 4GB/s of garbage (24 threads doing nothing but allocating) with about 2 milliseconds of stop-the-world pause time over around 3 minutes of runtime.

adgjlsfhk1 · on June 12, 2022

Note that benchmarking GCs properly is really hard. Changes in size distribution, tree shape, and lifespan can lead to drastically different results (and to make matters worse, the type of code that is easiest to write as a benchmark tends to be types of code that GC can handle really well).

dc-programmer · on June 12, 2022

Looks like it’s complicated. It preallocates memory on the heap, but on Linux this would be virtual memory whose pages have to loaded later. Also more memory will have be requested if the initial amount is insufficient. And this doesn’t include any stack or swap resources the JVM might use.

dale_glass · on June 12, 2022

I think this problem occurs at many places in the industry. Developer learns a thing, then keeps trying to apply it two decades later, long after it became completely obsolete.

That's how you get things like people unrolling a loop by hand. Compilers have known how to do it for ages, and have built-in data about what works best on what CPU. 99% of the time they do a better job than a programmer that developed a pattern back when they were writing for a 386.

These days it often works better to write very straightforward, non-clever code and let the compiler figure out how to optimize it.

Const-me · on June 12, 2022

Here’s an example of a simple C++ function, manually vectorized + unrolled for optimal performance on modern processors: https://stackoverflow.com/a/59495197/126995

Of all C++ compilers, only clang generates similar machine code from a straightforward source. And only with some special compiler switches, most importantly the -fast-math. Dot product of 2 vectors is a trivially simple algorithm, sum( a[ i ] * b[ i ], i = 0..N-1 ), I wouldn’t expect clang to auto-vectorize more complicated stuff. Finally, C++ compilers are designed to work offline so the optimizer can afford to spend substantial CPU time searching for a best optimization, a JIT compiler like Java runtime simply doesn’t have time for that.

dale_glass · on June 12, 2022

That's not quite what I'm saying though. There's of course going to be exceptions, especially in highly specialized cases.

> Finally, C++ compilers are designed to work offline so the optimizer can afford to spend substantial CPU time searching for a best optimization, a JIT compiler like Java runtime simply doesn’t have time for that.

Not necessarily. A JIT can also see exactly what the program is doing, see that 1000 out of a million lines of code are performance critical, and throw all the effort on optimizing that, while armed with stats about how the program works in practice and which branches are taken how often.

You also don't need to wait for it, since there's no reason why that work can't be done on a separate thread without blocking execution.

It can also generate native code for your specific CPU, so it may well do much better than GCC there.

Const-me · on June 12, 2022

I agree about the scalar code, I know from experience JIT compilers can be awesome, sometimes they indeed outperform AOT.

Automatic vectorization, on the other hand…

Modern SIMD arrived in 1999, SSE1 in Pentium III. For the 20+ years which followed, very smart compiler developers tried to improve their automatic vectorizers. Yet they only achieved very limited success so far.

They do a good job when all of the following is true: (1) pure vertical operations (2) no branches or conditions (3) the numbers being handled are either FP32 or FP64.

I think building a sufficiently good automatic vectorizer is borderline impossible task. Even when the runtime is very sophisticated like modern Java, with several of these progressively better versions based on the real-time performance profiler data, the problem is still extremely hard to solve.

For instance, here’s a fast way to sort 4 floats with SSE https://godbolt.org/z/c97Yf5js8 I don’t believe a compiler could have possibly figured out these shuffles and blends from any reasonable scalar implementation.

kaba0 · on June 12, 2022

Android’s Java is pretty much a bastard child of OpenJDK - it is so behind on features that it is truly a shame.

my123 · on June 12, 2022

Android 13 started to give some OpenJDK 11 features... but ugh - nothing that we can widely rely on yet.

Core Java 11 features will be backported to Android 12 with ART as an APEX module, but that didn't happen quite yet as far as I can see.

izacus · on June 12, 2022

Which is why the language of choice for Android is Kotlin which has all modern features without the necessity of changing the runtime.

pjmlp · on June 13, 2022

It might look like it, yet it is deceptive.

Unless Kotlin decides to stop following JVM capabilities and interoperability with modern Java features, those capabilities need to exist in Android as well.

Otherwise, the JVM Kotlin libraries won't have a place on Android, unless they get coded twice, with Kotlin's version of #ifdef.

Google's surprise decision to update Java support to core Java 11 LTS, was most likely triggered by the Java ecosystem now basing on Java 11 LTS as the minimum version, and they want to keep those libraries somehow available on Android.

grishka · on June 13, 2022

Actually, "desugaring" is a thing. It's a backport of those new parts of the standard library that the dex converter inserts into your app. Currently, it's only Java 8, but I had no trouble building an app that uses Java 17. You just can't use any features that rely on new classes or methods, like records. But purely syntactical ones do work flawlessly.

I don't like Kotlin because of how much it tries to do in the language itself with maximally terse syntax, and because of its asinine defaults like final-by-default classes and immutable-by-default everything. Java's sugar is much better thought out.