Very interesting writeup.
Distributed systems problem solving is always a very interesting process. It very frequently uncovers areas ripe for instrumentation improvement.
The Erlang Ecosystem seemed very mature and iterated. It almost seemed like the "rails of distributed system" with things like Mnesia.
The one downside to that seemed to be that while I was working on grokking the system, the limits and observability of some of these built-in solutions was not so clear. What happens when a mailbox exceeds it's limit? Does the data get dropped? Or, how to recover from a network segmentation? These proved somewhat challenging to reproduce and troubleshoot (as distributed problems can be).
There are answers for all of these interesting scenarios, but in some cases it almost would have been simpler to use an external technology (redis/etc) with established scalability/observability.
I do say this knowing that there was plenty I did not get time to learn about the ecosystem in the depth that I desired, but was curious how more experienced Erlang engineers viewed the problem.
The Erlang Manual does a pretty good job of describing the system limits and how to configure them, as well as the failure modes of various components.
By default, a mailbox will continue to fill up until process reaches its configured max heap size (which by default is unlimited, i.e. the process heap will grow until the system runs out of memory, eventually crashing the node its running on). However, you can configure this on a process-by-process basis, by specifying a max heap size and what to do when that limit is reached. This is described in the docs, but as you mentioned, it's not necessarily apparent to newcomers.
But aside from that scenario, I think a lot of the interesting failure scenarios are really sensitive to what the system is doing. For example, network partitioning can either be a non-issue, or critical, depending on how you are managing state in the system. As a result, I don't think there is too much that really digs deep into those problems because it turns out to be really hard to document that kind of knowledge in a generic fashion - or at least that's how it feels to me. Everyone I've worked with has built up a toolbox of techniques they use for the task at hand, and do their best to share them when they can. It's unfortunate there isn't really a one-stop shop of such information out there though.
I think it's probably also good advice for newcomers to remember that you don't have to use something just because its there (like mnesia) versus something you are already running or are more familiar with which solves the same problem (e.g. redis).
Agreed on all parts.
The Erlang and OTP manuals were very nicely written, and I was able to reason about most aspects of the system pretty well from reading them.
I did a bit more research after writing up my comment (my mind got a bit too focused on it to let it go) and found this great resource about handling various system load scenarios: https://ferd.ca/handling-overload.html
I'll +1 your pragmatic comment on not adopting tools just because they're there.
Again I no longer work in Erlang, but I find the systems, architecture, and problem solving particularly interest piquing.
Now I'm off to look up production use-cases where Mnesia was the most pragmatic solution.
Fred's blog, and Learn You Some Erlang for Great Good are invaluable, but on the topic of production systems, his ebook Erlang In Anger (https://www.erlang-in-anger.com/) is excellent as well - honestly it's hard to overstate just how much good he's done for the community in terms of documenting and philosphizing about Erlang, architecture and operating production systems. He's solid gold!
The ability to format MNesia tables in such a way that exporting them over SNMP is trivial was an absolute joy to work with for me about 12 years ago!
I was able to stand up a quick management solution for a rather complex system as a one-person team using a combination of Erlang, MNesia, and port drivers to various back-ends written in Python, C, C++ and Haskell. It was the most productive I think I've ever been on any project in my entire career so far.
And I'd love to get back to that feeling of just kicking ass every day.
1) the lack of a useful backpressure mechanism on mailbox size is a long standing issue in Erlang. It tends to not be too big a problem in practice, since overflows come from simple bugs that stop your system immediately and are found during development, rather than subtle conflicts that lurk around.
2) Erlang clustering isn't really supposed to withstand splits. Erlang was built for phone switches before the internet and while it is "distributed", the clustering is between a few boards or boxes that are wired together in the same rack, maybe over a hardwired LAN, not through the viscissitudes of a routed network like the internet connecting remote cities. So the cluster has to keep running if a node is out, but the idea is that node has crashed (or maybe equivalently, the network wire has been cut), not that the connection between nodes has somehow become flaky in a way that can be fixed with retries. Of course ordinary network connections are normally supervised under OTP and they do get restarted.
I actually really like the "mailboxes are on the receiver's heap" thing. One of the big reasons why Go's channels bother me is that they're bounded. This makes things far more complicated, because they're not bound to a go routine, and an irresponsive goroutune can't get terminated.
> What happens when a mailbox exceeds it's limit? Does the data get dropped?
I thought there was some movement towards limits on mailboxes, but I can't find any documentation now, so I'm not sure if that happened? If not (or if you haven't configured it anyway), there is no explicit limit, your mailboxes can grow until you run out of memory; either by hitting a ulimit, or malloc fails, or maybe until your OS just kills processes (and probably the BEAM process, because it's biggest). In the first two cases, you'll get a nice crash dump from BEAM, but in all cases all messages are dropped, as BEAM is dead. Edit: i see there's a process_flag(max_heap_size, MaxHeapSize) to set the maximum size of the heap, and if process_flag(message_queue_data, on_heap), the default, is set, messages will eventually end up on the heap. But the maximum heap size is checked during Garbage Collection, but IIRC, GC can't be triggered when a message is added to the mailbox, only while the process is running, or if explicitly requested for the process (with erlang:garbage_collect/0 or /1); if your process ends up blocked for a long time (or possibly forever), it could still accumulate a large mailbox without being killed by the heap size limit.
You can (and should!) regularly call process_info(Pid, message_queue_len) to observe the message queue of all processes, and alert on large queues. You can then observe the messages themselves and consider appropriate response.
> Or, how to recover from a network segmentation? These proved somewhat challenging to reproduce and troubleshoot (as distributed problems can be).
Recovering from network segmentation is application dependent, and can often be tricky. Some applications can just reconnect and call it a day. Other applications may have accepted writes on both sides of the segmentation, and need some sort of reconciliation process. Mnesia has hooks for this, but I don't remember seeing any examples, and the default logic is to just continue segmented even after the segmentation is done; this is usually not what you want, but at least it's consistent? I think it should be fairly easy to simulate and trigger network segmentation, just kill drop packets between selected hosts; although you'll need more work if you want to simulate stuff like congestion between hosts or congestion on only some paths between hosts (LACP is very nice, but debugging congestion on only some paths isn't as nice).
On this particular issue, where I worked, we had a policy of flushing mailboxes that were too big (usually 1 million messages, which isn't the Erlang way, and wasn't in public OTP, but keeps a node running at least), and we wouldn't have tried to log all of the messages in a mailbox, because 1 million messages or whatever is way too many to log. Pretty printing with no limits is dangerous, even if it doesn't include a ton of references to the same big thing. We also didn't tend to use anonymous functions/closures, but that's just a happy accident: we were using Erlang before crash dumps had line numbers, and anonymous functions are hard to track down, so it's easier to give them a real name and use that instead. Of course, there's some places where closures are way more convenient than explicitly passing Terms to Funs, so it's not that we never used them, just they were rare, and unlikely to show up many times in a single logging statement, like in this case.
The Erlang Ecosystem seemed very mature and iterated. It almost seemed like the "rails of distributed system" with things like Mnesia.
The one downside to that seemed to be that while I was working on grokking the system, the limits and observability of some of these built-in solutions was not so clear. What happens when a mailbox exceeds it's limit? Does the data get dropped? Or, how to recover from a network segmentation? These proved somewhat challenging to reproduce and troubleshoot (as distributed problems can be).
There are answers for all of these interesting scenarios, but in some cases it almost would have been simpler to use an external technology (redis/etc) with established scalability/observability.
I do say this knowing that there was plenty I did not get time to learn about the ecosystem in the depth that I desired, but was curious how more experienced Erlang engineers viewed the problem.