Memory is running out and Discourse stops working

The big numbers don’t really help here, I need to do some heap dumping to isolate the culprit.

@sam is there an app dump or similar you could add to get consistent dumps of data (in-memory heaps) the way you want them?

Working on some memory snapshotting stuff at the moment. We have /admin/memory_stats now which I have been improving, but no cross process version of it and no snapshotting yet.

Its probably going to take me another day or so to get all the bits in.

1 Like

As you requested, excluding redis-cli KEYS '*' | awk '{print $1}' | sort because that was just pages of what looked like non-important info which I doubt would fit in a post here.

root@forum:/# redis-cli INFO
# Server
redis_version:2.8.18
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:44fa2530a359c310
redis_mode:standalone
os:Linux 3.13.0-32-generic x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.8.2
process_id:30
run_id:6dc9fae563192bd69ec93e510f181d94294ea147
tcp_port:6379
uptime_in_seconds:19182
uptime_in_days:0
hz:10
lru_clock:14250637
config_file:/etc/redis/redis.conf

# Clients
connected_clients:29
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:1

# Memory
used_memory:386369688
used_memory_human:368.47M
used_memory_rss:413036544
used_memory_peak:398744176
used_memory_peak_human:380.27M
used_memory_lua:35840
mem_fragmentation_ratio:1.07
mem_allocator:jemalloc-3.6.0

# Persistence
loading:0
rdb_changes_since_last_save:1189
rdb_bgsave_in_progress:0
rdb_last_save_time:1423536511
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:4
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:37
total_commands_processed:180994
instantaneous_ops_per_sec:6
total_net_input_bytes:29630357
total_net_output_bytes:71038502
instantaneous_input_kbps:48.32
instantaneous_output_kbps:0.08
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:708
evicted_keys:0
keyspace_hits:94843
keyspace_misses:4208
pubsub_channels:1
pubsub_patterns:0
latest_fork_usec:29471

# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:38.97
used_cpu_user:40.13
used_cpu_sys_children:44.74
used_cpu_user_children:217.81

# Keyspace
db0:keys=164423,expires=2466,avg_ttl=6070294389
root@forum:/# redis-cli KEYS '*' | awk '{print $1}' | wc -l
164424
root@forum:/#

db0:keys=758,expires=270,avg_ttl=127271656996

That’s a lot more keys that don’t expire on yours than mine…

1 Like

Perhaps because I have a few more topics / posts ?

Very likely, I think we have a couple of redis keys per topic for view tracking etc.

I am pretty sure this will fix the issue:

https://github.com/discourse/discourse/commit/acda6ebd605d36aa5a265fe36f3cb023bf438db0

Please confirm.

Note:

I will be adding a memory governer into unicorn to ensure that leaking processes get cycled, I do notice that a month in often sidekiqs can climb up to 1GB of RSS, will force a cycle after 30 minutes of uptime and more than 400mb rss (configurable)

4 Likes

So this is a new bug related to the additional view tracking data you added?

yes, its related to view tracking

Thanks @sam!

So I have ran rebuild from the command line

root@forum:/var/discourse# ./launcher rebuild app
WARNING: No swap limit support
Updating discourse docker
remote: Counting objects: 11, done.
...

~Build Start: 2015-02-10T07:37:02.391921
~Build End: 2015-02-10T07:48:42.554345

Current system status

  System information as of Mon Feb  9 21:50:43 EST 2015

  System load:  0.1                Processes:              121
  Usage of /:   63.1% of 39.25GB   Users logged in:        1
  Memory usage: 78%                IP address for eth0:    
  Swap usage:   2%                 IP address for docker0: 

Current version: Discourse 1.2.0.beta6 - https://github.com/discourse/discourse version 0ce6524153b7e5545a834d2edf1a6b0de1cdcaef

Now we play the waiting game.

Memory stats for reference - might be useful when comparing to non-working state if occurs again.
(unable to run with ?full=1 as ends up reporting Bad Gateway).

forum pid:128 unicorn worker     E production  c config/unicorn conf rb

GC STATS:
count: 51
heap_used: 2657
heap_length: 2937
heap_increment: 280
heap_live_num: 570066
heap_free_num: 510811
heap_final_num: 0
total_allocated_object: 5559315
total_freed_object: 4989249

Objects:
TOTAL: 1080875
FREE: 510762
T_STRING: 295318
T_ARRAY: 72002
T_DATA: 66176
T_OBJECT: 47373
T_NODE: 41502
T_HASH: 25286
T_CLASS: 11559
T_REGEXP: 3680
T_ICLASS: 3382
T_MODULE: 1658
T_STRUCT: 1234
T_RATIONAL: 892
T_MATCH: 15
T_BIGNUM: 14
T_FLOAT: 11
T_FILE: 10
T_COMPLEX: 1

Process Info:
Name:	ruby
State:	S (sleeping)
Tgid:	128
Ngid:	0
Pid:	128
PPid:	66
TracerPid:	0
Uid:	1000	1000	1000	1000
Gid:	33	33	33	33
FDSize:	64
Groups:	33 
VmPeak:	 1044208 kB
VmSize:	 1044204 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	  226068 kB
VmRSS:	  226068 kB
VmData:	  822488 kB
VmStk:	     136 kB
VmExe:	    2736 kB
VmLib:	   25964 kB
VmPTE:	     996 kB
VmSwap:	       0 kB
Threads:	7
SigQ:	0/15845
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000008301801
SigCgt:	0000000182006646
CapInh:	00000000a80425fb
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	00000000a80425fb
Seccomp:	0
Cpus_allowed:	3
Cpus_allowed_list:	0-1
Mems_allowed:	00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	1080
nonvoluntary_ctxt_switches:	2636


Classes:
Class report omitted use ?full=1 to include it

Looking though the code changes the traffic related changes occured ~5-6 days ago:

However if the following topic is related - it occurred ~10 days ago:
https://meta.discourse.org/t/memory-usage-and-blank-pages/24511

This is more of an FYI to make sure bases are covered.

There’s also an issue where we are being sloppy with Redis and storing

  1. Way, way too many “permanent” keys
  2. Too large keys (on the key side, the key is absurdly long)

So we need to fix that as well and I believe it is an independent issue.

1 Like

Just noticed this in admin on current build:

EDIT: Created separate bug:

FYI - this issue was originally reported / experienced by me on a build from ~6 days ago, I believe this was prior to tracking being added?

Hey Sam, my memory usage dipped briefly after updating (assuming it’s because the server restarted), but it’s creeping upward at exactly the same rate as before (~1MB/min).

What does admin/memory_stats show you?

I spent more time on this today, I cut down our initial boot memory by 10%.

I am not noticing any extreme growth around ruby heaps they seem fine, not seeing a leak there. Also added a large object report into ?full=true so we can catch any objects that just keep growing.

Also RSS and Peak seem to be pretty stable.

1 Like

Well done!

I mentioned before I have been unable to run with ?full=1.

Would you like me to update?
Or get the current status (what commands would you like me to run?), then update?