Another update on this issue.
I have been working pretty relentlessly on this issue. On Friday I completed step #5 of my 7 step plan, which gives me great visibility into memory usage in our enterprise.
I noticed that despite all efforts, memory was still climbing. It was even climbing up on containers that are pretty much inactive at the moment (brand new customers)
Having this kind of information is a godsend, it allows one to test various theories.
I spent a bit of time thinking of the trend in the graph. It is constantly going up and totaly unrelated to traffic. This ruled out pg and redis as prime candidates (though clearly anything is possible). Which left me looking at other c extensions.
Previous profiling already excluded a managed leak, the number of objects in the Ruby heap was simply not growing. Number of large objects also not growing.
So I thought about message_bus and the way it relies on EventMachine for 30 second polling and other bits and pieces.
I remembered I upgraded EventMachine recently.
Funny thing is that commit was all about limiting memory growth.
Anyway, it appears there is a memory leak in the EventMachine gem, that was recently merged in by @tmm1.
So, I went back to that container set and upgraded one of the containers in the set to the latest version of EventMachine last night, just before I went to sleep.
In the morning I could see this picture:
So I am very cautiously optimistic here. I applied the fix to our code:
We are now deploying this to all our customers, then I will be watching memory for the next 24 hours.
If all looks good we will push a new beta tomorrow. If not, well, I will continue hunting this down.
EDIT a few hours later, this is looking like the real deal across our entire infrastructure.