The information that it started in beta5/6 triggered me looking further back.
Initially I thought it was something we did in the last couple of weeks, but after graphing memory performance of a month old build with a current build, well nothing stuck out, except that all my recent memory work made our baseline way better.
I also noticed we had a rogue sidekiq which was quite old with rogue memory usage of 2GB.
I did notice this fairly recent report about multithreading issues with pg Google Groups
Our web workers use 5 threads which could be triggering some of this.
My plan is:
- Downgrade to old “good” version of pg (done). This unfortunately means this issue is back.
- Amend internal logic in unicorn so we do not run 5 threads and do everything from master thread.
- Create a standalone app that reproduces the memory issue under latest pg gem and report to pg
- Work with pg authors to resolve it, so we can again upgrade to latest.
- Deploy extensive memory profiling to our internal infrastructure (in-progress) so we can catch this in future
- Work on cutting down on redis memory requirement which is quite high now
- Consider building protection against rogue memory usage into our base image