Since I migrated my large forum to Discourse this year I’ve been seeing infrequent crashes with the cloud VM inaccessible via SSH and a call trace on the virtual console. The crashes happen every 3 to 6 weeks approximately without any specific pattern. I was initially running Discourse on Clear Linux because that’s what I was using to squeeze a bit more performance out of the system during the long and intensive migration of the old forum to Discourse. But I started to suspect that maybe Clear Linux was less stable due to all of its arcane performance optimizations, so I migrated my Discourse to Debian 12 Bookworm around the time of its release about 6 weeks ago.
Unfortunately today the Debian system had its first crash. Here’s the sequence of events:
kernel: Voluntary context switch within RCU read-side critical section!
kernel: CPU: 3 PID: 3235204 Comm: postmaster Tainted: G D 6.1.0-10-amd64 #1 Debian 6.1.37-1
journalctl shows the last log entry at 06:40:50. But the OS and Discourse still kept running. The last entry was just standard chatter from the Dockerized mail agent I run on the same VM.
~08:30 I checked that Discourse was up and running normally.
08:46 in Discourse error log: Unexpected error in Message Bus : ActiveRecord::ConnectionNotEstablished : connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: could not fork new process for connection: Cannot allocate memory
08:53 in Discourse error log: Failed to process hijacked response correctly : ActiveRecord::ConnectionNotEstablished : connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: could not fork new process for connection: Cannot allocate memory
09:01 in Discourse error log: Failed to handle exception in exception app middleware : ActiveRecord::StatementInvalid : PG::ObjectNotInPrerequisiteState: ERROR: lost connection to parallel worker
Last post on Discourse was at 09:17.
09:22 in Discourse error log: 'Track Visit' is still running after 90 seconds on db default, this process may need to be restarted!
09:22 in Discourse error log: Redis::TimeoutError (Connection timed out)
There were more similar Discourse logs until the time I noticed the site was down at around 11:20.
When I couldn’t log in via SSH I took these screenshots from the virtual console viewer and hard-rebooted the VM:
I’ve been administering Linux servers for a long time, and this chain of events doesn’t make sense to me. The Discourse logs seem to be a fairly obvious indication of an out-of-memory event, and the virtual console confirms that a component of my Dockerized mail server on the same VM got axed by the OOM killer. But there is no record of that OOM action in journalctl, which apparently quit working well before the other systems started failing. The apparently first event at 05:00:22 mentions the postmaster process (from PostgreSQL in the Discourse app container) several times, but the DB didn’t go down completely until at least after 09:17 when there was a successful post on Discourse.
Currently after running all day the system is showing normal memory usage, this is normally roughly where it sits at:
#> free -m
total used free shared buff/cache available
Mem: 7751 4965 129 1832 4773 2785
Swap: 3875 2879 996
The only slightly uncommon thing about my configuration is that the swap space is actually via Zram instead of a swap file or swap partition. I’ve been using Zram for years and have never had a problem. Also I installed the VM from scratch with the Debian installer ISO in order to have an XFS root filesystem instead of the standard EXT4 that the cloud provider’s Debian images use. The host is Hetzner, and after my initial Clear Linux installation of Discourse I created a different VM for the migration to Debian, so presumably I’m on a different hypervisor node now and I don’t think it’s a hardware problem. So I wonder if this was just a simple out-of-memory condition, or have I found an edge case in the combination of kernel 6.1 + Zram + XFS + KVM/virtio ? I’d appreciate any insight you might have.
Hmm. I would tend to agree, except for the kernel errors that started first. The VM had been running since 06/Jul without a single kernel oops until this morning. Here’s the full output of that instant. Notice the page_fault_oops and handle_mm_fault and xfs_filemap_map_pages stuff:
I kind of think the same thing, except that this is somewhat of a repeat issue, feels slightly not random enough. I suspect that Hetzner probably doesn’t use ECC RAM, that’s probably how they can offer so much for the price. Even their dedicated servers apparently don’t/didn’t have ECC. But even so Hetzner is generally regarded as quite reliable in terms of their infrastructure.
My hunch is this . Try to get rid of both Zram and XFS (one by one) and see what happens. With Zram as my first suspect. Discourse should run fine with regular swap and ext4. These optimizations might be fun but they are currently raising the complexity of your installation. Once your instance runs fine you can add them back one by one and see where things break.
As a general rule, try to stick as close to a recommended install first, then add your own smart stuff.
Thanks for the reply. I think I’ll try disabling Zram and adding a 2GB swap file. The filesystem change would require completely rebuilding the VM with a new installation of Debian, and XFS really shouldn’t cause problems ever.
I wish that were true but don’t get me started on XFS. I have wasted at least 200 hours of my life in the past decade on XFS causing memory problems in the kernel.
Well, looks like @RGJ was absolutely right about XFS. Thanks for pointing me in the right direction. (I’ve been using mainly XFS as my first choice since around 2002, so I’ve always taken for granted that it’s rock solid, which it is as a filesystem, but apparently there are memory-related bugs.) This same issue occurred after disabling zRAM, and then Debian released an update for the 6.1 kernel that includes a patch for crashes with XFS: