I’ve got a discourse server which has been failing a lot over recent days, and I’m looking for some help in understanding the problem and fixing it.
I have a lot of OOM killer events, some of which have taken the server down such that a reboot has been required even to get access to ssh.
Disk space is low, with redis data on disk having increased to around 20GB at present.
There is an unusually high CPU level. Whenever I’ve looked, this is a single ruby process. Looking at it’s activity with strace, shows little activity (i.e. it’s not busy with any kind of system calls). Looking with ltrace, I see a lot of calls to malloc, memcpy and strlen. I can see some of the contents of the data being manipulated and it looks like junk data. E.g.
memcpy(0xc000da80, "\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337"..., 87) = 0xc000da80
Note the sequence of byte values.
Does anyone know what’s likely to be going on here, or have any suggestions on how to interrogate this further?