From my experience this is not directly solved by any current approach or has a linear solution. In fact, separating them in different machines is not an instant solution for that issue.
We also experience heavy drops and “the site is extremely busy so you are seeing it as someone that isn’t logged in” messages when a big event happens (such as a game, like @ljpp said), and that drags down the whole site, not only the people inside that topic.
So, I tried two different things, a separated setup and a “big machine”, both have this type of issues. My instances are monitored with Prometheus and the logs are visible on Grafana, etc, so I have a very granular control of hardware/container performance, and I can confirm that it really doesn’t matter what you do, the issue happens anyway.
If you put a big machine behind it you may delay it a little bit, but you will get the errors and sessions drops and the machine will be with almost no usage, be it disk, cpu or ram. And this happens both with the “default install” and “two container” installs.
With different machines the issue is the same, regardless of the machines being the same type of machine or one being “CPU-Optimized” and the other “Disk-Optimized”, etc. To this you also have to add the extra layer of possible failure of the connection between two different machines, that will inevitably lag, although this amount of lag can change in regards of how you setup that connection and “how far away” are the two machines from each other, but you will get the same behavior.
As a note, this type of behavior happens as well with things like the Babel plugin, however, seems to me that the Babel Plugin can handle a lot more “simultaneous” writes, even though the “chats” are actually hidden topics, but the difference is in how they are presented and “refreshed”/“pulled”. This difference in behavior has brought me to the conclusion of this being some applicational correlation that derives from a FrontEnd kind of issue “crashing” the app (being that FrontEnd is not my area of expertise, contrary to BackEnd) and the operations at hand by posting and people staying on a topic waiting for it to “self update” with tens of messages on a single minute.
To that you also have to add the human factor, when people feel the site is “sluggish” or that a topic “isn’t updating as fast as it should be”, they will F5 the hell out of it, adding more load. But good luck “educating” on that regard