Memory is running out and Discourse stops working

Thing about this is that all the numbers except RSS here are inline with what we would expect. Ruby heaps look good, VmPeak looks good.

Can you gather multiple snapshots in say 3 hour intervals from the same PID and paste here?

@DeanMarkTaylor I fixed the localization stuff, we no longer load all locales we just load on-demand as we need them.

1 Like

Each time I refresh /admin/memory_stats I get a new pid. Want me to do a screenshot of the htop screen, or is there a better way? If it helps:

There are only actually 3 process involved there so reload it a few times and you will get the same pid.

memory looks ok in the screenshot you just sent, you have one unicorn consuming 315M which is ok it already clocked an hour of cpu time so I assume it has been running for quite a few hours.

I’ll update in that case @sam.

##Before Update
###System Info

  System information as of Wed Feb 11 05:05:39 EST 2015

  System load:  0.08               Processes:              120
  Usage of /:   49.5% of 39.25GB   Users logged in:        0
  Memory usage: 65%                IP address for eth0:    
  Swap usage:   2%                 IP address for docker0: 

###Memory Stats
https://gist.github.com/DeanMarkTaylor/00e65b9dda0f4513086c
###Host Top

Tasks: 119 total,   1 running, 118 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.9 us,  1.8 sy,  0.0 ni, 91.3 id,  0.9 wa,  0.1 hi,  0.0 si,  0.0 st
KiB Mem:   2049988 total,  1981412 used,    68576 free,     1376 buffers
KiB Swap:  2097148 total,   502476 used,  1594672 free.   306772 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20   0   33492    968    100 S   0.0  0.0   2:09.63 init
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.78 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0   1:47.82 ksoftirqd/0
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
    7 root      20   0       0      0      0 S   0.0  0.0  25:47.11 rcu_sched
    8 root      20   0       0      0      0 S   0.0  0.0  12:24.95 rcuos/0
    9 root      20   0       0      0      0 S   0.0  0.0  11:33.04 rcuos/1
   10 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
   11 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/0
   12 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/1
   13 root      rt   0       0      0      0 S   0.0  0.0   0:35.58 migration/0
   14 root      rt   0       0      0      0 S   0.0  0.0   0:21.10 watchdog/0
   15 root      rt   0       0      0      0 S   0.0  0.0   0:16.24 watchdog/1

###Instance Top

top - 07:14:20 up 24 days,  3:04,  0 users,  load average: 0.64, 0.26, 0.20
Tasks:  37 total,   1 running,  36 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.9 us,  1.8 sy,  0.0 ni, 91.3 id,  0.9 wa,  0.1 hi,  0.0 si,  0.0 st
KiB Mem:   2049988 total,  1980556 used,    69432 free,     3700 buffers
KiB Swap:  2097148 total,   502676 used,  1594472 free.   307648 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20   0   21092     96     64 S   0.0  0.0   0:00.04 boot
   28 root      20   0     188     12      0 S   0.0  0.0   0:02.35 runsvdir
   29 root      20   0     168     12      0 S   0.0  0.0   0:00.02 runsv
   30 root      20   0     168     12      0 S   0.0  0.0   0:00.00 runsv
   31 root      20   0     168     12      0 S   0.0  0.0   0:00.01 runsv
   32 root      20   0     168     12      0 S   0.0  0.0   0:00.01 runsv
   33 root      20   0     168     16      0 S   0.0  0.0   0:00.01 runsv
   34 root      20   0     168     12      0 S   0.0  0.0   0:00.00 runsv
   35 root      20   0     168     12      0 S   0.0  0.0   0:00.01 runsv
   36 discour+  20   0   30752   1700    472 S   0.0  0.1   1:20.89 unicorn_launche
   37 postgres  20   0  386204  12492  11984 S   0.0  0.6   0:09.94 postmaster
   38 root      20   0  121152    356    128 S   0.0  0.0   0:00.04 nginx
   39 syslog    20   0  180148    688    260 S   0.0  0.0   0:01.00 rsyslogd
   41 redis     20   0  431496 126476    664 S   0.0  6.2   5:57.01 redis-server
   42 root      20   0   26776    272    152 S   0.0  0.0   0:00.73 cron
   43 root      20   0   61364    116     12 S   0.0  0.0   0:00.05 sshd
   54 www-data  20   0  122116   1528    488 S   0.0  0.1   0:23.39 nginx
   55 www-data  20   0  122184   1620    560 S   0.0  0.1   0:22.30 nginx
   56 www-data  20   0  122088   1500    640 S   0.0  0.1   0:22.11 nginx
   57 www-data  20   0  122100   1760    724 S   0.0  0.1   0:25.52 nginx
   58 www-data  20   0  121332    336     36 S   0.0  0.0   0:01.92 nginx
   61 postgres  20   0  386440 166292 165548 S   0.0  8.1   0:10.14 postmaster
   62 postgres  20   0  386336  74600  73940 S   0.0  3.6   0:12.68 postmaster
   63 postgres  20   0  386336   7800   7344 S   0.0  0.4   0:19.65 postmaster
   64 postgres  20   0  387156   3180   2432 S   0.0  0.2   0:04.60 postmaster
   65 postgres  20   0  103908    940    248 S   0.0  0.0   0:13.42 postmaster
   66 discour+  20   0  449576 142456    892 S   0.0  6.9   3:36.46 ruby
  119 discour+  20   0 1221376 344980   2692 S   0.0 16.8  13:05.36 ruby
  129 discour+  20   0 1224464 360064   3292 S   0.0 17.6  12:53.68 ruby
  142 discour+  20   0 1231636 367464   3184 S   0.0 17.9  13:11.65 ruby
29749 postgres  20   0  394588  11108   8716 S   0.0  0.5   0:00.06 postmaster
29750 postgres  20   0  395040  72564  69584 S   0.0  3.5   0:00.37 postmaster
29800 postgres  20   0  394588  10996   8620 S   0.0  0.5   0:00.04 postmaster
29847 root      20   0   21272   1988   1576 S   0.0  0.1   0:00.01 bash
29859 discour+  20   0   20036    836    624 S   0.0  0.0   0:00.00 sleep
29860 root      20   0   22976   1404   1020 R   0.0  0.1   0:00.00 top
31161 discour+  20   0 1218332 201044   2488 S   0.0  9.8   7:11.33 ruby

##After Update
###Build

root@forum:/var/discourse# ./launcher rebuild app

Begin: ~2015-02-12T07:15:56.103858
End: ~2015-02-12T07:26:59.543223

###Discourse Version
Discourse 1.2.0.beta6 - https://github.com/discourse/discourse version 5f8e604abc4a99df267b2d4e6544678040ab1ea6

###System Info

  System information as of Thu Feb 12 02:02:44 EST 2015

  System load:  0.11               Processes:              120
  Usage of /:   49.7% of 39.25GB   Users logged in:        0
  Memory usage: 82%                IP address for eth0:    
  Swap usage:   23%                IP address for docker0: 

###Memory Stats
https://gist.github.com/DeanMarkTaylor/07a461fd710135f3ea20

###Host Top

top - 02:39:16 up 24 days,  3:29,  1 user,  load average: 0.30, 0.30, 0.43
Tasks: 117 total,   1 running, 116 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.9 us,  1.8 sy,  0.0 ni, 91.3 id,  0.9 wa,  0.1 hi,  0.0 si,  0.0 st
KiB Mem:   2049988 total,  1965492 used,    84496 free,    13096 buffers
KiB Swap:  2097148 total,    42196 used,  2054952 free.   758272 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20   0   33492   1736    664 S   0.0  0.1   2:10.69 init
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.78 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0   1:47.91 ksoftirqd/0
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
    7 root      20   0       0      0      0 S   0.0  0.0  25:50.30 rcu_sched
    8 root      20   0       0      0      0 S   0.0  0.0  12:26.52 rcuos/0
    9 root      20   0       0      0      0 S   0.0  0.0  11:34.35 rcuos/1
   10 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
   11 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/0
   12 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/1
   13 root      rt   0       0      0      0 S   0.0  0.0   0:35.93 migration/0
   14 root      rt   0       0      0      0 S   0.0  0.0   0:21.11 watchdog/0
   15 root      rt   0       0      0      0 S   0.0  0.0   0:16.26 watchdog/1
   16 root      rt   0       0      0      0 S   0.0  0.0   0:38.39 migration/1
   17 root      20   0       0      0      0 S   0.0  0.0   0:35.07 ksoftirqd/1
   19 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:0H
   20 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 khelper
   21 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kdevtmpfs
   22 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 netns
   23 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 writeback
   24 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kintegrityd
   25 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 bioset
   27 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kblockd
   28 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 ata_sff
   29 root      20   0       0      0      0 S   0.0  0.0   0:00.00 khubd
   30 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 md
   31 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 devfreq_wq
   35 root      20   0       0      0      0 S   0.0  0.0   0:11.46 khungtaskd
   36 root      20   0       0      0      0 S   0.0  0.0  88:40.83 kswapd0
   37 root      25   5       0      0      0 S   0.0  0.0   0:00.00 ksmd
   38 root      39  19       0      0      0 S   0.0  0.0   1:05.92 khugepaged
   39 root      20   0       0      0      0 S   0.0  0.0   0:00.00 fsnotify_mark
   40 root      20   0       0      0      0 S   0.0  0.0   0:00.00 ecryptfs-kthrea
   41 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 crypto
   53 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kthrotld
   55 root      20   0       0      0      0 S   0.0  0.0   0:00.00 vballoon
   56 root      20   0       0      0      0 S   0.0  0.0   0:00.00 scsi_eh_0
   57 root      20   0       0      0      0 S   0.0  0.0   0:00.00 scsi_eh_1
   78 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 deferwq
   79 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 charger_manager
  121 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kpsmoused
  134 root      20   0       0      0      0 S   0.0  0.0   2:05.92 jbd2/vda1-8
  135 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 ext4-rsv-conver

###Instance Top

top - 07:40:25 up 24 days,  3:30,  0 users,  load average: 0.25, 0.28, 0.42
Tasks:  39 total,   1 running,  38 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.9 us,  1.8 sy,  0.0 ni, 91.3 id,  0.9 wa,  0.1 hi,  0.0 si,  0.0 st
KiB Mem:   2049988 total,  1977248 used,    72740 free,    19988 buffers
KiB Swap:  2097148 total,    42176 used,  2054972 free.   759972 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20   0   21092   1496   1240 S   0.0  0.1   0:00.04 boot
   27 root      20   0     188     36     20 S   0.0  0.0   0:00.02 runsvdir
   28 root      20   0     168      4      0 S   0.0  0.0   0:00.00 runsv
   29 root      20   0     168      4      0 S   0.0  0.0   0:00.00 runsv
   30 root      20   0     168      4      0 S   0.0  0.0   0:00.01 runsv
   31 root      20   0     168      4      0 S   0.0  0.0   0:00.00 runsv
   32 root      20   0     168      4      0 S   0.0  0.0   0:00.00 runsv
   33 root      20   0     168      4      0 S   0.0  0.0   0:00.00 runsv
   34 root      20   0     168      4      0 S   0.0  0.0   0:00.00 runsv
   35 root      20   0  121152   3712   2444 S   0.0  0.2   0:00.05 nginx
   36 root      20   0   26776   1264   1012 S   0.0  0.1   0:00.01 cron
   37 syslog    20   0  180148   1304    888 S   0.0  0.1   0:00.03 rsyslogd
   38 postgres  20   0  386204  19724  18348 S   0.0  1.0   0:00.21 postmaster
   39 redis     20   0  427400 396436   1376 S   0.0 19.3   0:09.02 redis-server
   40 root      20   0   61364   2644   1968 S   0.0  0.1   0:00.03 sshd
   42 discour+  20   0   29768   3860   1472 S   0.0  0.2   0:01.14 unicorn_launche
   53 www-data  20   0  122080   3084   1032 S   0.0  0.2   0:00.27 nginx
   54 www-data  20   0  121776   2916   1052 S   0.0  0.1   0:00.11 nginx
   55 www-data  20   0  122080   3140   1060 S   0.0  0.2   0:00.28 nginx
   56 www-data  20   0  121752   2912   1176 S   0.0  0.1   0:00.18 nginx
   57 www-data  20   0  121332   1924    472 S   0.0  0.1   0:00.00 nginx
   60 postgres  20   0  386336   5212   3720 S   0.0  0.3   0:00.03 postmaster
   61 postgres  20   0  386336   3960   2552 S   0.0  0.2   0:00.18 postmaster
   62 postgres  20   0  386336   2676   1296 S   0.0  0.1   0:00.26 postmaster
   63 postgres  20   0  387156   3004   1036 S   0.0  0.1   0:00.04 postmaster
   64 postgres  20   0  103908   1928    356 S   0.0  0.1   0:00.12 postmaster
   65 discour+  20   0  392232 156404   9744 S   0.0  7.6   0:15.67 ruby
   95 discour+  20   0  488716 165048   5840 S   0.0  8.1   0:06.15 ruby
  123 discour+  20   0  423188 168916   5496 S   0.0  8.2   0:13.89 ruby
  130 discour+  20   0  414996 164064   5632 S   0.0  8.0   0:10.00 ruby
  138 discour+  20   0  995072 177876   8060 S   0.0  8.7   0:12.52 ruby
  859 postgres  20   0  406156 135496 123140 S   0.0  6.6   0:03.89 postmaster
  860 postgres  20   0  406160  90216  78296 S   0.0  4.4   0:02.78 postmaster
  999 postgres  20   0  406428 134840 122644 S   0.0  6.6   0:02.74 postmaster
 1023 postgres  20   0  397976 122332 115852 S   0.0  6.0   0:02.13 postmaster
 1167 postgres  20   0  397332  79632  74216 S   0.0  3.9   0:03.69 postmaster
 1209 root      20   0   21272   1984   1576 S   0.0  0.1   0:00.01 bash
 1221 root      20   0   22976   1400   1020 R   0.0  0.1   0:00.00 top
 1222 discour+  20   0   20036    832    620 S   0.0  0.0   0:00.00 sleep

###Notes
Translations do seem to be gone from initial memory load - nice one.

###Possibly of note
It appears this same string is repeated several times

Regexp: size 9844 (?-mix:(?:[a-zA-Z][\-+.a-zA-Z\d]*:(?:(?:\/\/(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()

Regexp: size 9844 (?-mix:(?:[a-zA-Z][\-+.a-zA-Z\d]*:(?:(?:\/\/(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()

Regexp: size 9844 (?-mix:(?:[a-zA-Z][\-+.a-zA-Z\d]*:(?:(?:\/\/(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()

Regexp: size 9844 (?-mix:(?:[a-zA-Z][\-+.a-zA-Z\d]*:(?:(?:\/\/(?:(?:(?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*@)?(?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\])(?::\d*)?|(?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()

I don’t know if this is related, but I’m getting server error 500’s when I try to post replies. The site itself is up and appears to be working, but posting messages doesn’t work.

I had restarted since I posted the original, but here’s the first one since reboot yesterday:

CitizensCode2GB-app pid:2380 unicorn worker 1   E production  c config/unicorn conf rb

GC STATS:
count: 53
heap_used: 2358
heap_length: 4071
heap_increment: 1713
heap_live_num: 497618
heap_free_num: 461613
heap_final_num: 0
total_allocated_object: 8271702
total_freed_object: 7774084

Objects:
TOTAL: 959228
FREE: 461563
T_STRING: 271227
T_DATA: 63980
T_ARRAY: 54434
T_NODE: 39463
T_HASH: 25141
T_OBJECT: 24011
T_CLASS: 9875
T_REGEXP: 3486
T_ICLASS: 3100
T_MODULE: 1450
T_RATIONAL: 895
T_STRUCT: 554
T_MATCH: 13
T_BIGNUM: 12
T_FILE: 12
T_FLOAT: 11
T_COMPLEX: 1

Process Info:
Name:	ruby
State:	S (sleeping)
Tgid:	2380
Ngid:	0
Pid:	2380
PPid:	4722
TracerPid:	0
Uid:	1000	1000	1000	1000
Gid:	33	33	33	33
FDSize:	64
Groups:	33 
VmPeak:	  483504 kB
VmSize:	  483500 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	  235200 kB
VmRSS:	  234256 kB
VmData:	  261876 kB
VmStk:	     136 kB
VmExe:	    2736 kB
VmLib:	   25892 kB
VmPTE:	     952 kB
VmSwap:	       0 kB
Threads:	7
SigQ:	0/31457
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000008301801
SigCgt:	0000000182006646
CapInh:	00000000a80425fb
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	00000000a80425fb
Seccomp:	0
Cpus_allowed:	3
Cpus_allowed_list:	0-1
Mems_allowed:	00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	2954
nonvoluntary_ctxt_switches:	5218


Classes:
Class report omitted use ?full=1 to include it

This problem stuck me for more than one month since I started to use discourse.

Every time I try to install imagemagick in SSH, it always shows “Fatal Error ************”.

And also there some other problems which I think are related to this. Sometimes I post a thread and upload photos, it says “Sorry, there was an error uploading that file. Please try again.”. But sometimes it works no problem.

These problems are showed both beta5 and beta6. I install discourse in Digitalocean (1g plan) following the exactly process in dicourse.org. Only difference is that I didn’t setup swap at first, but installed later. Still not solved.

I really want to know what is going on, or this script only run above 2g vps plan?

No, there were just some memory bugs in the very latest releases that @sam is working on.

1 Like

9 hours later:

CitizensCode2GB-app pid:2380 unicorn worker 1   E production  c config/unicorn conf rb

GC STATS:
count: 111
heap_used: 2358
heap_length: 4071
heap_increment: 1713
heap_live_num: 544639
heap_free_num: 414592
heap_final_num: 0
total_allocated_object: 29252624
total_freed_object: 28707985

Objects:
TOTAL: 959228
FREE: 414542
T_STRING: 296617
T_DATA: 68783
T_ARRAY: 63205
T_NODE: 44307
T_HASH: 25507
T_OBJECT: 23697
T_CLASS: 12034
T_REGEXP: 3721
T_ICLASS: 3522
T_MODULE: 1746
T_RATIONAL: 894
T_STRUCT: 603
T_MATCH: 13
T_BIGNUM: 13
T_FILE: 12
T_FLOAT: 11
T_COMPLEX: 1

Process Info:
Name:	ruby
State:	S (sleeping)
Tgid:	2380
Ngid:	0
Pid:	2380
PPid:	4722
TracerPid:	0
Uid:	1000	1000	1000	1000
Gid:	33	33	33	33
FDSize:	64
Groups:	33 
VmPeak:	  624932 kB
VmSize:	  624928 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	  380996 kB
VmRSS:	  380252 kB
VmData:	  401140 kB
VmStk:	     136 kB
VmExe:	    2736 kB
VmLib:	   25984 kB
VmPTE:	    1232 kB
VmSwap:	       0 kB
Threads:	7
SigQ:	0/31457
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000008301801
SigCgt:	0000000182006646
CapInh:	00000000a80425fb
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	00000000a80425fb
Seccomp:	0
Cpus_allowed:	3
Cpus_allowed_list:	0-1
Mems_allowed:	00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	18709
nonvoluntary_ctxt_switches:	35623


Classes:
Class report omitted use ?full=1 to include it

We think this is the pg (postgres) gem which was recently upgraded (to fix the rare encoding errors we saw) and the bug / memory leak is in the C code there.

The information that it started in beta5/6 triggered me looking further back.

Initially I thought it was something we did in the last couple of weeks, but after graphing memory performance of a month old build with a current build, well nothing stuck out, except that all my recent memory work made our baseline way better.

I also noticed we had a rogue sidekiq which was quite old with rogue memory usage of 2GB.

I did notice this fairly recent report about multithreading issues with pg Google Groups

Our web workers use 5 threads which could be triggering some of this.

My plan is:

  1. Downgrade to old “good” version of pg (done). This unfortunately means this issue is back.
  2. Amend internal logic in unicorn so we do not run 5 threads and do everything from master thread.
  3. Create a standalone app that reproduces the memory issue under latest pg gem and report to pg
  4. Work with pg authors to resolve it, so we can again upgrade to latest.
  5. Deploy extensive memory profiling to our internal infrastructure (in-progress) so we can catch this in future
  6. Work on cutting down on redis memory requirement which is quite high now
  7. Consider building protection against rogue memory usage into our base image
12 Likes

Updating as per public notice, keeping a log here as per previous activity.

Before: https://gist.github.com/DeanMarkTaylor/a3d3c36658bd95807c0f
System Update: https://gist.github.com/DeanMarkTaylor/09897ff1e207809e159c
After: https://gist.github.com/DeanMarkTaylor/ba150541457bf60a0cdc

Noting during system update the following message was displayed:

The following packages have been kept back:
  lxc-docker

Can you give me some specific process to solve this problem? I am not farmiliar with ruby on rail. And this problem became more and more.

My site is just a new site, only 1 user and less than 100 posts. I follow the tutor on discourse.org to install with DO 1g plan and do nothing more, but my memory useage super high (did nothing).

Did you rebuild the container to get to 1.2.0.beta8?

1 Like

Another update on this issue.

I have been working pretty relentlessly on this issue. On Friday I completed step #5 of my 7 step plan, which gives me great visibility into memory usage in our enterprise.

https://github.com/discourse/discourse_docker/commit/03b50438d73dbe6076a5a4179e336afaef2b28c2

I noticed that despite all efforts, memory was still climbing. It was even climbing up on containers that are pretty much inactive at the moment (brand new customers)

Having this kind of information is a godsend, it allows one to test various theories.

I spent a bit of time thinking of the trend in the graph. It is constantly going up and totaly unrelated to traffic. This ruled out pg and redis as prime candidates (though clearly anything is possible). Which left me looking at other c extensions.

Previous profiling already excluded a managed leak, the number of objects in the Ruby heap was simply not growing. Number of large objects also not growing.

So I thought about message_bus and the way it relies on EventMachine for 30 second polling and other bits and pieces.

I remembered I upgraded EventMachine recently.

https://github.com/discourse/discourse/commit/d1dd0d888a950d6121afdb764aeeaaa35757ede7#diff-e79a60dc6b85309ae70a6ea8261eaf95

Funny thing is that commit was all about limiting memory growth.

Anyway, it appears there is a memory leak in the EventMachine gem, that was recently merged in by @tmm1.

So, I went back to that container set and upgraded one of the containers in the set to the latest version of EventMachine last night, just before I went to sleep.

In the morning I could see this picture:

So I am very cautiously optimistic here. I applied the fix to our code:

https://github.com/discourse/discourse/commit/43375c8b15f95ac3eb4a797b6a99d20f354cc1e6

We are now deploying this to all our customers, then I will be watching memory for the next 24 hours.

If all looks good we will push a new beta tomorrow. If not, well, I will continue hunting this down.

EDIT a few hours later, this is looking like the real deal across our entire infrastructure.

17 Likes

Confirmed that this plugged the leak - the break in the graph is when I updated:

3 Likes

(this is just a log of activity on my instance as I have been reporting in this thread)

###Prior to @sam’s EventMachine update:
I noted the ImageMagicK message in admin today, attempting to check memory stats returned blank pages, although the forum was responsive.

I thought I would take this time to update to the latest with the EventMachine update.

My instance had been running for approx. 71 hours before noticing the memory issue again.
(between approx. 2015-02-13T16:58:24 and 2015-02-16T16:01:01)

###Instance Updated
The instance is now updated to:
Discourse 1.2.0.beta8 - version 3cad4824d74e8ff3cfdc20d79f300904d73cd73e

Here are the usual before and afters:
Before: https://gist.github.com/DeanMarkTaylor/9ceabc4ff4960f847570
After: https://gist.github.com/DeanMarkTaylor/face341110492c4552ec

This is now resolved confirmed via multiple sources.

If you are having trouble updating to latest:

cd /var/discourse
git pull
./launcher rebuild app
4 Likes