Memory is running out and Discourse stops working

###Possibly of note
It appears this same string is repeated several times

Regexp: size 9844 (?-mix:(?:[a-zA-Z][\-+.a-zA-Z\d]*:(?:(?:\/\/(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()

Regexp: size 9844 (?-mix:(?:[a-zA-Z][\-+.a-zA-Z\d]*:(?:(?:\/\/(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()

Regexp: size 9844 (?-mix:(?:[a-zA-Z][\-+.a-zA-Z\d]*:(?:(?:\/\/(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()

Regexp: size 9844 (?-mix:(?:[a-zA-Z][\-+.a-zA-Z\d]*:(?:(?:\/\/(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()

I don’t know if this is related, but I’m getting server error 500’s when I try to post replies. The site itself is up and appears to be working, but posting messages doesn’t work.

I had restarted since I posted the original, but here’s the first one since reboot yesterday:

CitizensCode2GB-app pid:2380 unicorn worker 1   E production  c config/unicorn conf rb

GC STATS:
count: 53
heap_used: 2358
heap_length: 4071
heap_increment: 1713
heap_live_num: 497618
heap_free_num: 461613
heap_final_num: 0
total_allocated_object: 8271702
total_freed_object: 7774084

Objects:
TOTAL: 959228
FREE: 461563
T_STRING: 271227
T_DATA: 63980
T_ARRAY: 54434
T_NODE: 39463
T_HASH: 25141
T_OBJECT: 24011
T_CLASS: 9875
T_REGEXP: 3486
T_ICLASS: 3100
T_MODULE: 1450
T_RATIONAL: 895
T_STRUCT: 554
T_MATCH: 13
T_BIGNUM: 12
T_FILE: 12
T_FLOAT: 11
T_COMPLEX: 1

Process Info:
Name:	ruby
State:	S (sleeping)
Tgid:	2380
Ngid:	0
Pid:	2380
PPid:	4722
TracerPid:	0
Uid:	1000	1000	1000	1000
Gid:	33	33	33	33
FDSize:	64
Groups:	33 
VmPeak:	  483504 kB
VmSize:	  483500 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	  235200 kB
VmRSS:	  234256 kB
VmData:	  261876 kB
VmStk:	     136 kB
VmExe:	    2736 kB
VmLib:	   25892 kB
VmPTE:	     952 kB
VmSwap:	       0 kB
Threads:	7
SigQ:	0/31457
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000008301801
SigCgt:	0000000182006646
CapInh:	00000000a80425fb
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	00000000a80425fb
Seccomp:	0
Cpus_allowed:	3
Cpus_allowed_list:	0-1
Mems_allowed:	00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	2954
nonvoluntary_ctxt_switches:	5218


Classes:
Class report omitted use ?full=1 to include it

This problem stuck me for more than one month since I started to use discourse.

Every time I try to install imagemagick in SSH, it always shows “Fatal Error ************”.

And also there some other problems which I think are related to this. Sometimes I post a thread and upload photos, it says “Sorry, there was an error uploading that file. Please try again.”. But sometimes it works no problem.

These problems are showed both beta5 and beta6. I install discourse in Digitalocean (1g plan) following the exactly process in dicourse.org. Only difference is that I didn’t setup swap at first, but installed later. Still not solved.

I really want to know what is going on, or this script only run above 2g vps plan?

No, there were just some memory bugs in the very latest releases that @sam is working on.

1 Like

9 hours later:

CitizensCode2GB-app pid:2380 unicorn worker 1   E production  c config/unicorn conf rb

GC STATS:
count: 111
heap_used: 2358
heap_length: 4071
heap_increment: 1713
heap_live_num: 544639
heap_free_num: 414592
heap_final_num: 0
total_allocated_object: 29252624
total_freed_object: 28707985

Objects:
TOTAL: 959228
FREE: 414542
T_STRING: 296617
T_DATA: 68783
T_ARRAY: 63205
T_NODE: 44307
T_HASH: 25507
T_OBJECT: 23697
T_CLASS: 12034
T_REGEXP: 3721
T_ICLASS: 3522
T_MODULE: 1746
T_RATIONAL: 894
T_STRUCT: 603
T_MATCH: 13
T_BIGNUM: 13
T_FILE: 12
T_FLOAT: 11
T_COMPLEX: 1

Process Info:
Name:	ruby
State:	S (sleeping)
Tgid:	2380
Ngid:	0
Pid:	2380
PPid:	4722
TracerPid:	0
Uid:	1000	1000	1000	1000
Gid:	33	33	33	33
FDSize:	64
Groups:	33 
VmPeak:	  624932 kB
VmSize:	  624928 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	  380996 kB
VmRSS:	  380252 kB
VmData:	  401140 kB
VmStk:	     136 kB
VmExe:	    2736 kB
VmLib:	   25984 kB
VmPTE:	    1232 kB
VmSwap:	       0 kB
Threads:	7
SigQ:	0/31457
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000008301801
SigCgt:	0000000182006646
CapInh:	00000000a80425fb
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	00000000a80425fb
Seccomp:	0
Cpus_allowed:	3
Cpus_allowed_list:	0-1
Mems_allowed:	00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	18709
nonvoluntary_ctxt_switches:	35623


Classes:
Class report omitted use ?full=1 to include it

We think this is the pg (postgres) gem which was recently upgraded (to fix the rare encoding errors we saw) and the bug / memory leak is in the C code there.

The information that it started in beta5/6 triggered me looking further back.

Initially I thought it was something we did in the last couple of weeks, but after graphing memory performance of a month old build with a current build, well nothing stuck out, except that all my recent memory work made our baseline way better.

I also noticed we had a rogue sidekiq which was quite old with rogue memory usage of 2GB.

I did notice this fairly recent report about multithreading issues with pg Google Groups

Our web workers use 5 threads which could be triggering some of this.

My plan is:

  1. Downgrade to old “good” version of pg (done). This unfortunately means this issue is back.
  2. Amend internal logic in unicorn so we do not run 5 threads and do everything from master thread.
  3. Create a standalone app that reproduces the memory issue under latest pg gem and report to pg
  4. Work with pg authors to resolve it, so we can again upgrade to latest.
  5. Deploy extensive memory profiling to our internal infrastructure (in-progress) so we can catch this in future
  6. Work on cutting down on redis memory requirement which is quite high now
  7. Consider building protection against rogue memory usage into our base image
12 Likes

Updating as per public notice, keeping a log here as per previous activity.

Before: https://gist.github.com/DeanMarkTaylor/a3d3c36658bd95807c0f
System Update: https://gist.github.com/DeanMarkTaylor/09897ff1e207809e159c
After: https://gist.github.com/DeanMarkTaylor/ba150541457bf60a0cdc

Noting during system update the following message was displayed:

The following packages have been kept back:
  lxc-docker

Can you give me some specific process to solve this problem? I am not farmiliar with ruby on rail. And this problem became more and more.

My site is just a new site, only 1 user and less than 100 posts. I follow the tutor on discourse.org to install with DO 1g plan and do nothing more, but my memory useage super high (did nothing).

Did you rebuild the container to get to 1.2.0.beta8?

1 Like

Another update on this issue.

I have been working pretty relentlessly on this issue. On Friday I completed step #5 of my 7 step plan, which gives me great visibility into memory usage in our enterprise.

https://github.com/discourse/discourse_docker/commit/03b50438d73dbe6076a5a4179e336afaef2b28c2

I noticed that despite all efforts, memory was still climbing. It was even climbing up on containers that are pretty much inactive at the moment (brand new customers)

Having this kind of information is a godsend, it allows one to test various theories.

I spent a bit of time thinking of the trend in the graph. It is constantly going up and totaly unrelated to traffic. This ruled out pg and redis as prime candidates (though clearly anything is possible). Which left me looking at other c extensions.

Previous profiling already excluded a managed leak, the number of objects in the Ruby heap was simply not growing. Number of large objects also not growing.

So I thought about message_bus and the way it relies on EventMachine for 30 second polling and other bits and pieces.

I remembered I upgraded EventMachine recently.

https://github.com/discourse/discourse/commit/d1dd0d888a950d6121afdb764aeeaaa35757ede7#diff-e79a60dc6b85309ae70a6ea8261eaf95

Funny thing is that commit was all about limiting memory growth.

Anyway, it appears there is a memory leak in the EventMachine gem, that was recently merged in by @tmm1.

So, I went back to that container set and upgraded one of the containers in the set to the latest version of EventMachine last night, just before I went to sleep.

In the morning I could see this picture:

So I am very cautiously optimistic here. I applied the fix to our code:

https://github.com/discourse/discourse/commit/43375c8b15f95ac3eb4a797b6a99d20f354cc1e6

We are now deploying this to all our customers, then I will be watching memory for the next 24 hours.

If all looks good we will push a new beta tomorrow. If not, well, I will continue hunting this down.

EDIT a few hours later, this is looking like the real deal across our entire infrastructure.

17 Likes

Confirmed that this plugged the leak - the break in the graph is when I updated:

3 Likes

(this is just a log of activity on my instance as I have been reporting in this thread)

###Prior to @sam’s EventMachine update:
I noted the ImageMagicK message in admin today, attempting to check memory stats returned blank pages, although the forum was responsive.

I thought I would take this time to update to the latest with the EventMachine update.

My instance had been running for approx. 71 hours before noticing the memory issue again.
(between approx. 2015-02-13T16:58:24 and 2015-02-16T16:01:01)

###Instance Updated
The instance is now updated to:
Discourse 1.2.0.beta8 - version 3cad4824d74e8ff3cfdc20d79f300904d73cd73e

Here are the usual before and afters:
Before: https://gist.github.com/DeanMarkTaylor/9ceabc4ff4960f847570
After: https://gist.github.com/DeanMarkTaylor/face341110492c4552ec

This is now resolved confirmed via multiple sources.

If you are having trouble updating to latest:

cd /var/discourse
git pull
./launcher rebuild app
4 Likes