Memory is running out and Discourse stops working

Another update on this issue.

I have been working pretty relentlessly on this issue. On Friday I completed step #5 of my 7 step plan, which gives me great visibility into memory usage in our enterprise.

https://github.com/discourse/discourse_docker/commit/03b50438d73dbe6076a5a4179e336afaef2b28c2

I noticed that despite all efforts, memory was still climbing. It was even climbing up on containers that are pretty much inactive at the moment (brand new customers)

Having this kind of information is a godsend, it allows one to test various theories.

I spent a bit of time thinking of the trend in the graph. It is constantly going up and totaly unrelated to traffic. This ruled out pg and redis as prime candidates (though clearly anything is possible). Which left me looking at other c extensions.

Previous profiling already excluded a managed leak, the number of objects in the Ruby heap was simply not growing. Number of large objects also not growing.

So I thought about message_bus and the way it relies on EventMachine for 30 second polling and other bits and pieces.

I remembered I upgraded EventMachine recently.

https://github.com/discourse/discourse/commit/d1dd0d888a950d6121afdb764aeeaaa35757ede7#diff-e79a60dc6b85309ae70a6ea8261eaf95

Funny thing is that commit was all about limiting memory growth.

Anyway, it appears there is a memory leak in the EventMachine gem, that was recently merged in by @tmm1.

So, I went back to that container set and upgraded one of the containers in the set to the latest version of EventMachine last night, just before I went to sleep.

In the morning I could see this picture:

So I am very cautiously optimistic here. I applied the fix to our code:

https://github.com/discourse/discourse/commit/43375c8b15f95ac3eb4a797b6a99d20f354cc1e6

We are now deploying this to all our customers, then I will be watching memory for the next 24 hours.

If all looks good we will push a new beta tomorrow. If not, well, I will continue hunting this down.

EDIT a few hours later, this is looking like the real deal across our entire infrastructure.

17 Likes