My forum is crashing every two hours

(katherine) #1

I have a self-hosted Discourse on a DO droplet that has been running smoothly for the past 6 months, and then today has been crashing every two hours or so. Everything is up to date as far as I know, so I’m not really sure how to best go about troubleshooting. Any help would be greatly, greatly appreciated!!

I haven’t gotten any errors in my site logs since early this morning, even though the forum has crashed a few times since then.

I did get this early this morning: Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

with backtrace:

 /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.3/lib/redis/client.rb:121:in `call'
    /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.3/lib/redis.rb:494:in `block in del'
    /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.3/lib/redis.rb:58:in `block in synchronize'
    /usr/local/lib/ruby/2.3.0/monitor.rb:214:in `mon_synchronize'
    /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.3/lib/redis.rb:58:in `synchronize'
    /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.3/lib/redis.rb:493:in `del'
    /var/www/discourse/lib/discourse_redis.rb:192:in `block in del'
    /var/www/discourse/lib/discourse_redis.rb:146:in `ignore_readonly'
    /var/www/discourse/lib/discourse_redis.rb:190:in `del'
    /var/www/discourse/lib/distributed_mutex.rb:24:in `ensure in synchronize'
    /var/www/discourse/lib/distributed_mutex.rb:25:in `synchronize'
    /var/www/discourse/lib/scheduler/manager.rb:294:in `lock'
    /var/www/discourse/lib/scheduler/manager.rb:247:in `tick'
    /var/www/discourse/config/initializers/100-sidekiq.rb:35:in `block (2 levels) in <top (required)>

Redis logs:

LOG:  duration: 787.213 ms  execute <unnamed>: SELECT COUNT(*) AS count_all, DATE(created_at) AS date_created_at FROM "email_logs" WHERE "email_logs"."skipped" = 'f' AND (created_at BETWEEN '2017-02-16 00:00:00.000000' AND '2017-03-16 23:59:59.999999') GROUP BY DATE(created_at)  ORDER BY DATE(created_at)
2017-03-16 17:54:20 UTC [397-2] discourse@discourse LOG:  duration: 132.612 ms  execute <unnamed>: SELECT COUNT(*) FROM "email_logs"
2017-03-16 17:56:14 UTC [127-1] discourse@discourse LOG:  duration: 131.163 ms  statement: UPDATE posts
	                SET avg_time = (x.gmean / 1000)
	                FROM (SELECT post_timings.topic_id,
	                             round(exp(avg(ln(msecs)))) AS gmean
	                      FROM post_timings
	                      INNER JOIN posts AS p2
	                        ON p2.post_number = post_timings.post_number
	                          AND p2.topic_id = post_timings.topic_id
	                          AND p2.user_id <> post_timings.user_id
	                      GROUP BY post_timings.topic_id, post_timings.post_number) AS x
	                WHERE (x.topic_id = posts.topic_id
	                  AND x.post_number = posts.post_number
	                  AND (posts.avg_time <> (x.gmean / 1000)::int OR posts.avg_time IS NULL)) AND (posts.topic_id IN (SELECT id FROM topics where bumped_at > '2017-03-14 17:56:14.249910'))
50:M 16 Mar 17:58:33.081 * 10 changes in 300 seconds. Saving...
50:M 16 Mar 17:58:33.093 * Background saving started by pid 4883
4883:C 16 Mar 17:58:35.217 * DB saved on disk
4883:C 16 Mar 17:58:35.229 * RDB: 44 MB of memory used by copy-on-write
50:M 16 Mar 17:58:35.326 * Background saving terminated with success
50:M 16 Mar 18:03:36.028 * 10 changes in 300 seconds. Saving...
50:M 16 Mar 18:03:36.033 * Background saving started by pid 5220
5220:C 16 Mar 18:03:39.632 * DB saved on disk
5220:C 16 Mar 18:03:39.636 * RDB: 43 MB of memory used by copy-on-write
50:M 16 Mar 18:03:39.736 * Background saving terminated with success

(Jeff Atwood) #2

Are you on latest? I would update to latest.

And you followed our official install guide? Any third party plugins? Do you have swap configured?

(katherine) #3

I am on latest.
I did follow the official install.
No third-party plugins.
Yes, I have swap configured.

I’m also getting the same error on some rejected emails, but I’m not sure how that could be connected.

(Rafael dos Santos Silva) #4

Please give the output of the following commands:

df -h
free -m
docker info

(Jay Pfaffman) #5

Every time I see this I get that much closer to creating a ./help script that’ll delete old docker and apt images and then print the output of these commands.

(katherine) #6

output of df -h -->

Filesystem      Size  Used Avail Use% Mounted on
udev            981M     0  981M   0% /dev
tmpfs           201M  3.3M  197M   2% /run
/dev/vda1        40G   13G   26G  33% /
tmpfs          1001M  744K 1000M   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs          1001M     0 1001M   0% /sys/fs/cgroup
none             40G   13G   26G  33% /var/lib/docker/aufs/mnt/0423714fa51102bf1a71981127dfb7d9118e9bb4712d0fcaec8491ba6d36f657
shm              64M     0   64M   0% /var/lib/docker/containers/de2f8f4514b97c386a8014e087deee693839c35b441c3d24e2fd6935af1f5f63/shm
none             40G   13G   26G  33% /var/lib/docker/aufs/mnt/0b46972ec34b3e6df9a8f8dae96d38791aca450723d07c830e5f26d72e352fb2
shm              64M  4.0K   64M   1% /var/lib/docker/containers/5e69613f388aee058b87e8c59c422e1e9b3cf182399ac97f5fd2d4461b3dbe49/shm
tmpfs           201M     0  201M   0% /run/user/0

output of free -m -->

              total        used        free      shared  buff/cache   available
Mem:           2000        1639          78          68         282         126
Swap:          2047          76        1971

output of docker info -->

Containers: 2
 Running: 2
 Paused: 0
 Stopped: 0
Images: 5
Server Version: 17.03.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 23
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
  Profile: default
Kernel Version: 4.4.0-66-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.953 GiB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
WARNING: No swap limit support
Experimental: false
Insecure Registries:
Live Restore Enabled: false

(Rafael dos Santos Silva) #7

Hmmm everything looks good :thinking:

Please take a look at and see if there’s a giant sidekiq queue.

PS: Legal ver um Discourse brasileiro sobre um assunto tão legal!!

(Matt Palmer) #8

Check the Redis container logs (./launcher logs app by default) for why Redis is failing to save. It should be reporting all sorts of errors if an RDB save failed.

(katherine) #9

Doesn’t seem like there’s anything in the sidekiq queue… hmm

P.S. >> Obrigado!! nós gostamos :slight_smile:

(katherine) #10

Checked them and I can’t seem to find anything that looks like an error since my last rebuild… maybe it’s fixed? Really strange though, because I don’t have any explanation for why this happened.

(Matt Palmer) #11

The rebuild would have nuked any previous logs that indicated the problem. Next time the problem happens, before you do a rebuild, capture the logs. There’ll definitely be something in there, almost certainly either a disk full or out-of-memory error (between them, they account for about 99.99% of RDB save fails).

(katherine) #12

Ah okay, thanks that is very helpful!