Using Erlang hot code updates

268 points by lawik 8 months ago

jhgg 8 months ago

When I worked at Discord, we used BEAM hot code loading pretty extensively, built a bunch of tooling around it to apply and track hot-patches to nodes (which in turn could update the code on >100M processes in the system.) It allowed us to deploy hot-fixes in minutes (full tilt deploy could complete in a matter of seconds) to our stateful real-time system, rather than the usual ~hour long deploy cycle. We generally only used it for "emergency" updates though.

The tooling would let us patch multiple modules at a time, which basically wrapped `:rpc.call/4` and `Code.eval_string/1` to propagate the update across the cluster, which is to say, the hot-patch was entirely deployed over erlang's built-in distribution.

davisp 8 months ago

This matches my experience. I spent a decade operating Erlang clusters and using hot code upgrades is a superpower for debugging a whole class of hard to track bugs. Although, without the tracking for cluster state it can be its own footgun when a hotpatch gets unpatched during a code deploy.
As for relups, I once tried starting a project to make them easier but eventually decided that the number of bazookas pointed at each and every toe made them basically a non-starter for anything that isn’t trivial. And if its trivial it was already covered by the nl (network load, send a local module to all nodes in the cluster and hot load it) style tooling.
- scotty79 8 months ago
  
  > Although, without the tracking for cluster state it can be its own footgun when a hotpatch gets unpatched during a code deploy.
  This and everything else said sounds so much like PHP+FTP workflow. It's so good.
- SexxxyKitty17 8 months ago
  
  [flagged]
- SexyKitty17 8 months ago
  
  [flagged]
stouset 8 months ago

Can someone explain how this is not genuinely terrifying from a security perspective?
- nelsonic 8 months ago
  
  Where is the security problem? All code commits and builds can still be signed. All of this is just a more efficient way of deploying changes without dropping existing connections.
  Are you suggesting that hot code replacement is somehow a attack vector? Ericsson has been using this method for decades on critical infrastructure to patch switches without dropping live calls/connections it works.
  No need to fear Erlang/BEAM.
  - stouset 8 months ago
    
    My interpretation of the GP was that a code change in one node can be automagically propagated out to a cluster of participating Erlang nodes.
    As a security person, this seems inherently dangerous. I asked why it is safe, because I presumed I’m missing something due to the lack of ever hearing about exploitation in the wild.
    
    toast0 8 months ago
    
    An Erlang dist cluster has no barriers between connected nodes. But a multithreaded application has no barriers between its threads either.
    If someone can exploit one Erlang node, they can easily take over the cluster. But in a more typical horizontally scaled system, usually if they can get into one node, they can get into all the other nodes running the same software the same way.
    Security wise, I think of the whole cluster as one unit. There's no meaningful way to separate it, so it's just one thing. Best not to let anyone in who can't be trusted, because either they have access or they don't; there's no limited access.
    But given that, may as well push code updates over dist in a straight forward way, because it's possible, so it may as well be straight forward.
    
    badpenny 8 months ago
    
    Why is it any more dangerous than a conventional update, which also needs to be propagated?
    
    stouset 8 months ago
    
    A conventional update takes place out of band.
    If someone were to exploit a running Erlang process, the description of this feature sounds to me like they would have access to code paths that allow pushing new code to other Erlang processes on cooperating nodes.
    
    vermilingua 8 months ago
    
    Yes, but if they can exploit one process they can exploit any of the other nodes anyway, so there's nothing to be gained but a bit of convenience.
- ramchip 8 months ago
  
  Erlang distribution shouldn't be used between nodes that aren't in the same security boundary, it promises and provides no isolation whatsoever. It's kind of inherent to what it does: it makes a bunch of nodes behave as part of a single large system, so compromising one node compromises the system as a whole.
  In a use case like clustering together identical web servers, or message broker nodes like RabbitMQ, I don't think it's all that scary. It gives an attacker easier lateral movement, but that doesn't gain them a whole lot if all the nodes have the same permissions, operate on the same data, etc.
  Depending on risk appetite and latency requirements you can also isolate clusters at the deployment / datacenter level. RabbitMQ for instance uses Erlang clustering within a deployment (nodes physically close together, in the same or nearly the same configuration) and a separate federation protocol between clusters. This acts as a bulkhead to isolate problems and attackers.
- aunderscored 8 months ago
  
  It's the same amount of terrifying as a regular deploy, you need to ensure that you limit access as needed

elcritch 8 months ago

Code reloading on embedded Nerves devices is fantastic. If you have non-trivial hardware or state you can just hot load new code to test a fix live. Great for integration testing.

I literally used hot code reloading a few weeks back to fix a 4-20 mA circuit on a new beta firmware while a client was watching in remote Colorado. Told them I was “fixing a config”. Tested it on our device and then they checked it out over a satellite PLC system. Then I made an update Nerves FW, uploaded it. Made the client happy!

Note that I’ve found that using scp to copy the files to /tmp and then use Code.compile to work better than copy and paste in IEx. The error messages get proper line numbers.

It’s also very simple to write a helper function to compile all the code in /tmp and then delete it. I’ve got a similar one in my project that scp’s any changed elixir files in my project over. It’s pretty nice.

rozap 8 months ago

I used to work on a pretty big elixir project that had many clients with long lived connections that ran jobs that weren't easily resumable. Our company had a language agnostic deployment strategy based on docker, etc which meant we couldn't do hot code updates even though they would have saved our customers some headache.

Honestly I wish we had had the ability to do both. Sometimes a change is so tricky that the argument that "hot code updates are complicated and it'll cause more issues than it will solve" is very true, and maybe a deploy that forces everyone to reconnect is best for that sort of change. But often times we'd deploy some mundane thing where you don't have to worry about upgrading state in a running gen server or whatever, and it'd be nice to have minimal impact.

Obviously that's even more complexity piled onto the system, but every time I pushed some minor change and caused a retry that (in a perfect world at least...) didn't need to retry, I winced a bit.

ElevenLathe 8 months ago

I work in gaming and have experienced the opposite side of this: many of our services have more than one "kind" of update, each with its own caveats and gotchas, so that it takes an expert in the whole system (meaning really almost ALL of our systems) to determine which would be the least impactful possible one if nothing goes wrong. Not only is there a lot of complexity and lost productivity in managing this process ("Are we sure this change is zero downtime-able?" "Does it need a schema reload?" etc) but we often get it wrong. The result is that, in practice, anything even remotely questionable gets done during a full downtime where we kick players out.
It's sometimes helpful to have the option to just restart one little corner of the full system, to minimize impact, but it is helpful to customer experience (if we don't screw it up) and very much the opposite for developer experience (it's crippling to velocity to need to discuss each change with multiple experts and determine the appropriate type of release).
- rozap 8 months ago
  
  No doubt that traditional deployments are much better for dev experience at (sometimes) the cost of customer experience.
  - toast0 8 months ago
    
    I disagree. Hot loading means I can have a very short cycle on an issue, and move onto something else. Having to think about the implications of hot loading is worth it for the rapid cycle time and not having to hold as many changes in my mind at once.
  - ElevenLathe 8 months ago
    
    One thing that would help both is deployment automation that could examine the desired changes and work out the best way to deploy them without human input. For distributed systems, this would require rock-solid contracts between individual services for all relevant scenarios, and would also require each update to be specified completely in code (or at least something machine readable), ideally in one commit. This is a level of maturity that seems elusive in gaming.

hauxir 8 months ago

We use hot code upgrades on kosmi.io with great success.

It's absolute magic and allows for very rapid development and ease of deploying fixes and updates.

We do use have to use distillery though and have had to resort to a bunch of custom glue bash scripts which I wish was more standardized because it's such a killer feature.

Due to Elixirs efficiency, everything is running on a single node despite thousands of concurrents so haven't really experienced how it handles multiple nodes.

edude03 8 months ago

Nerves and hot code reloading got me into erlang after I watched a demo of patching code on a flying drone ~8 years ago.

While I can't imagine hot reloading is super practicle in production, it does highlight that erlang/beam/otp has great primitives for building reliable production systems.

atonse 8 months ago

I have told so many people about that video over the years. It was one of the most amazing demonstrations of a programming language/ecosystem that I've ever seen.
Yet I've never been able to find it again.
trq01758 8 months ago

Probably not the same video, but the one on this topic on youtube is https://youtu.be/XQS9SECCp1I
- atonse 8 months ago
  
  Ah I thought that was the one. Now I understand why I couldn't find it. But the other one was during a talk. Same effect though.
opnitro 8 months ago

Do you have a link?

throwaway81523 8 months ago

You have to be very very very careful when preparing relups. The alternative on Linux is to launch an entire new server on the same machine, then transfer the session data and the open sockets to it through IPC. I once asked Joe Armstrong whether this was as good as relups and why Erlang went the relup route. I don't remember the exact words and don't want to misquote him, but he basically said it was fine, and Erlang went with relups and hot patching because transferring connections (I guess they would have been hardware interfaces rather than sockets) wasn't possible when they designed the hot patch system.

Hot patching is a bit unsatisfying because you are still running the same VM afterwards. WIth socket migration you can launch a new VM if you want to upgrade your Erlang version. I don't know of a way to do it with existing software, but in principle using something like HAProxy with suitable extensions, it should be possible to even migrate connections across machines.

toast0 8 months ago

State migration is possible, and yeah, if you want to upgrade BEAM, state migration would be effective, whereas hot loading is not. If your VM gets pretty big, you might need to be careful about memory usage though, the donor VM is likely not going to shrink as fast as the heir VM grows. If you were so inclined, C does allow for hot loading too, but I think it'd be pretty hard to bend BEAM into something that you could hot load to upgrade.
Migrating socket state across machines is possible too, but I don't think it's anywhere close to mainstream. HAProxy is a lovely tool, but I'm pretty sure I saw something in its documentation that explicitly states that sort of thing is out of scope; they want to deal with user level sockets.
Linux has a TCP Repair feature which can be used as part of socket migration; but you'll also need to do something to forward packets to the new destination. Could be arping for the address from a new machine, or something fancier that can switch proportionally or ??? there's lots of options, depending on your network.
As much as I'd love to have a use case for TCP migration, it's a little bit too esoteric for me ... reconnecting is best avoided when possible, but I'm counting TCP migration as non-possible for purposes of the rule of thumb.
- throwaway81523 8 months ago
  
  TCP migration on the same machine is real and it's not that big a deal, if that's what you meant by TCP migration. Doing it across machines is at best a theoretical possibility, I would agree. I have been wanting to look into CRIU more carefully, but I believe it uses TCP Repair that you mentioned. I'm unfamiliar with it though.
  The saying in the Erlang crowd is that a non-distributed system can't be really reliable, since the power cord is a single point of failure. So a non-painful way to migrate across machines would be great. It just hasn't been important enough (I guess) for make anyone willing to deal with the technical obstacles.
  I wonder whether other OS's have supported anything like that.
  I worked on a phone switch (programmed in C) a long time ago that let you do both software and hardware upgrades (swap CPU boards etc.) while keeping connections intact, but the hardware was specially designed for that.
  - toast0 8 months ago
    
    > I wonder whether other OS's have supported anything like that.
    I don't think I've seen it, but I don't see everything, and it'd be pretty esoteric. From my memory of working with the FreeBSD tcp stack, I suspect it wouldn't be too hard to make something like this work there, too; other than the security aspects, but could probably do something like ok to 'repair' a connection that matches a listen socket you also pass or something. But you'd really need the use case to make the hassle worth it, and I don't think most regular server applications are enough to warrant it.
  - tguvot 8 months ago
    
    there were/are some linux patches/daemons to synchronize connection state/etc across servers
throwaway81523 8 months ago

Self-followup/correction to avoid misattributing something to Joe that I'm not sure he said. I don't remember him specifically saying there were technical obstacles to migrating connections from one BEAM to another. My main question to him was whether socket migration (such as with SCM_RIGHTS messages on Linux) was a viable alternative to relups. I expected him to say relups were better because [whatever] but instead he said migration was perfectly fine. I do think starting a new BEAM in such situations fits fine with the Erlang spirit of restarting crashed processes so that you start in a known state, rather than trying to recover inside the process.

robocat 8 months ago

Background to the article: https://underjord.io/unpacking-elixir-iot-embedded-nerves.ht...

Seems like they deploy Elixir on embedded Linux. The embedded Linux distro is Nerves which replaces systemd and boots to the BEAM VM instead as process 1, putting Elixir as close to the metal as they can.

I know nothing about any of the above (assumption is I'm fool enough to try and simplify) plus I know I've misused the concepts I wrote but that's my point so read the article. All simplifications are salads

rkangel 8 months ago

Don't worry - this is an accurate and concise summary of what Nerves is.
What I would add is: you do not have bash (or sh, or any other "conventional" shell). The BEAM is your userspace.
You can SSH in, but what you end up with is an IEx prompt (the Elixir REPL). This is surprisingly fine once you get used to it (and once you've built a few helpers for your usecase).

aeturnum 8 months ago

IMO Hot Code Updates are a tantalizing tool that can be useful at times but are extremely easy to foot-gun and have little support. I suspect that the reason why no one has built a nice, formal framework for organizing and fanning out hot code changes to erlang nodes is that it's very hard to do well, involves making some educated guesses about the halting problem, and generally doesn't help you much unless you're already in a real bind.

Most of the benefits of hot code updates (with better understanding of the boundaries of changes) can be found through judicious rolling restarts that things like k8s make easier these days. Any time you have the capacity to hot patch code on a node, you probably have the capacity to hot patch the node's setup as well.

That said I think that someone could use the code reloading abilities of erlang to make a genuinely unparalleled production problem diagnostic toolkit - where you can take apart a problem as it is happening in real time. The same kinds of people who are excited about time traveling debugging should be excited about this imo.

whorleater 8 months ago

WhatsApp very long ago used to hot reload across all nodes with a ssh script to incrementally deploy during the day

toast0 8 months ago

> Both have described hot code updates as something that people should learn and use. I imagine Whatsapp’s initial engineering crew would agree. They did pretty well.

Yeah. Hot loading is clearly better than anything else when you've got a million clients connected and you want to make a code change. Of course, we didn't have any of these fancy 'release' tools, we just used GNU Make to rsync the code to prod and run erlc. Then you can grab a debug shell and l(module). (we did write utilities to see what code was modified, and to provide the right incantations so we wouldn't load if it would kill processes)

rybosome 8 months ago

> Hot loading is clearly better than anything else when you've got a million clients connected and you want to make a code change.
In the contexts in which I’ve worked, this was solved by issuing a command to the server to enter a lame-duck mode and stop accepting new connections, then restarting the process with updated code after all existing connections ended.
This worked in our case because connections had a TTL with a “reasonable” time, couldn’t have been more than an hour. We could always wait it out.
I suppose hot reloading is more necessary when you have connections without a set TTL.
- toast0 8 months ago
  
  That way works, but it means you're spending that much more time on a deploy.
  For a small change, you can hot load your change and be done in minutes. This means you can push several small changes in an hour. Which helps get things done rapidly.
  It's also really nice to be able to see that the change works as expected (or not) at full load right away. If you've got to wait for connections to accumulate on the new server, that takes longer without hot load too.
  Some changes can't be effectively hot loaded[1], and for those you do need to do something to kick out users and let them reconnect elsewhere, and you could do all your updates that way, but it means a lot more client time spent reconnecting.
  On the one hour TTL. Sometimes that's reasonable, but sometimes it's really not. Someone downloading a large file on a slow connection is better served by letting the download continue to trickle for hours than forcing them to reconnect and resume. A real time call is better served by letting it run until the participants are done. For someone on a low end phone, staying connected for as long as they can is probably better than forcing a reconnect where they'll need to generate new ephemeral keys and do a key exchange exercise.
  [1] At the very least, BEAM updates and kernel changes are much more easily done by restarting. But not all userspace Erlang changes are easy to make hot loadable, either.

arnon 8 months ago

A few years ago, the biggest problem with Erlang's hot code updates was getting the files updated on all of the nodes. Has this been solved or improved in any way?

comboy 8 months ago

I don't think updating files is the problem. The biggest issue with hot code updates seems to be that they can create states that cannot be replicated in either release on its own.
- ketralnis 8 months ago
  
  This is my experience. About 25% of the time I'd encounter a bug that's impossible to reproduce without both versions of the code in memory, and end up restarting the node anyway dropping requests in the process. Whereas if I'd have architected around not having hot code updates I could built it in a way that never has to drop requests
- faizshah 8 months ago
  
  In general, you can save your team a lot of ops trouble just by periodically restarting your long running services from scratch instead of trying to keep alive a process or container for a long time.
  I’m still new to the erlang/elixir community and I haven’t run it in prod yet but this is my experience coming from Java, Node, and Python.
toast0 8 months ago

There's about a thousand different ways to update files on servers?
You can build os packages, and push those however you like.
You can use rsync.
You could push the files over dist, if you want.
You could probably do something cool with bittorrent (maybe that trend is over?)
If you write Makefiles to push, you can use make -j X to get low effort parallelization, which works ok if your node count isn't too big, and you don't need as instant as possible updates.
Erlang source and beam files don't tend to get very large. And most people's dist clusters aren't very large either; I don't think I've seen anyone posting large cluster numbers lately, but I'd be surprised if anyone was pushing to 10,000 nodes at once. Assuming they're well connected, pushing to 10,000 nodes takes some prep, but not that much; if you're driving it from your laptop, you probably want an intermediate pusher node in your datacenter, so you can push once from home/office internet to the pusher node, and then fork a bunch of pushers in the datacenter to push to the other hosts. If you've got multiple locations and you're feeling fancy, have a pusher node at each location, push to the pusher node nearest you; that pushes to the node at each location and from there to individual nodes.
Other issues are more pressing; like making sure you write your code so it's hotload friendly, and maybe trying to test that to confirm you won't use the immense power of hotloading to very rapidly crash all your server processes.
- samgranieri 8 months ago
  
  I think Twitter once cobbled together a BitTorrent based deployment strategy for Capistrano called murder, that was a cool read from their eng blog back in the day.
  I wish I had used a pusher node to deploy things when a colleague was using almost all the upstream bandwidth in the office making a video call when my bosses were giving demo and the fix I coded for an issue discovered during the demo could not deploy via Capistrano

samgranieri 8 months ago

I tried to do this back in 2017 as an elixir newbie with distillery, but for some reason just went with standard deploys with distillery. Now it’s just using mix release to build elixir apps in a docker image deployed to k8s.

Thaxll 8 months ago

Hot code update is one of those thing I don't understand, just use a rolling deployment, problem solved. You have a new version of the code without loosing any connection.

It's one of those thing that sound nice on paper but a actually couple your runtime with ci/cd, if you have anything else beside Erlang what do you do? You now need a second solution to deploy code.

AlphaWeaver 8 months ago

I'm not sure that rolling deployments guarantee you won't lose connections, depending on the type of connection. Imagine your customer is downloading a large file over a single TCP connection, and you want to upgrade the application mid-download.
With rolling deployments, your only choice is to wait until that connection drains by completing or failing the download. If that doesn't fit your use case, you're out of options.
If your application is an Erlang app, you could hot code reload an unaffected part of the application while the download finishes. Or, if the part of the application handling the download is an OTP pattern that supports hot code reloading (like a gen_server) you could even make changes to that module and release e.g. speed improvements mid download stream. This is why Erlang shines in applications like telephony, which it was originally designed for.
- fsckboy 8 months ago
  
  >With rolling deployments, your only choice is to wait until that connection drains by completing or failing the download. If that doesn't fit your use case, you're out of options.
  one of the cool things about unix is (and perhaps windows can do this in the right modes, idk), the running copy of a program is a link to the code on the disk (a link is a reference to the file, without the file name). You can delete a running program from the disk and replace it with a new program, but the running copy will continue and not be aware that you've done that. You don't need to wait till the program finishes anything.
  on an every day basis, this is what happens when you run software updates while you are still using your machine, even if your currently active programs get updated. you'll sometimes notice this in a program like Firefox, it will lose its ability to open new tabs; that's because they go out of their way to do that, they wouldn't have to if they wanted to avoid it, just fork existing processes.
  - AlphaWeaver 8 months ago
    
    Right, but in this example, to "pick up" the code after you have updated it, you still have to trigger a restart of the program somehow. Controlling that handoff can prove challenging if you're just swapping out the underlying binary.
  - toast0 8 months ago
    
    > one of the cool things about unix is (and perhaps windows can do this in the right modes, idk), the running copy of a program is a link to the code on the disk (a link is a reference to the file, without the file name). You can delete a running program from the disk and replace it with a new program, but the running copy will continue and not be aware that you've done that. You don't need to wait till the program finishes anything.
    An even cooler thing is the running code is just mmaped into memory. One of the nifty things about mmaped files is if you change the backing file, you change it everywhere.
    Not my recommended way to hot load code, but it might work in a pinch.
    unlink, replace, start a new one, have the old one stop listening does work for many things. Some OSes have/had issues with dropping a couple pending connections sometimes; or you have to learn the secret handshake to do it right. A bigger problem is if your daemon is sized to fit your memory, you might not be able to run a draining and a filling daemon at once.
    It also doesn't really solve the issue of changing code for existing connections, it's a point in time migration. Easier to reason about, for sure, but not always as effective.
fiddlerwoaroof 8 months ago

The other perspective on this is that, at some level of your system you are always doing a hot code reload: terraform, kubernetes, etc. are taking a new deployment description in and reconciling it with the existing state of the world. Wouldn’t it be nice if this process was just more code in your preferred programming language rather than YAML soup?
BEAM encourages you to structure your program as isolated interacting processes and so it’s not that far from a container runtime in itself.
- may13noon 8 months ago
  
  I like this comment
rozap 8 months ago

But what if you have long lived stateful connections? And you don't want a deploy to take forever?
Ofc you can say "don't do that" but sometimes it's just the way it is...
But I agree, 99% of the time a rolling update is easier and works fine.
nthh 8 months ago

It could be useful if you have an embedded device that you don't want to miss data from, but for most deployments I would agree.