Could be, if large amount of key-value churn is causing frequent dumps of the entire Redis datastore to disk.
Nope, because dumping the DB happens in a separate process from serving requests. Only CPU starvation and memory pressure (swapping) can cause Redis to slow down. Neither of those appear to be the case, based on the
iostat data provided (CPU 50+% idle, IOPS fairly low).
That’s a respectable rate of operations, but it shouldn’t be enough to cause Redis is collapse under its own weight.
Are you collecting all this data while the system is misbehaving, or during correct operation? Stats while the system is running properly aren’t as likely to show problems.
Yes, this is while it’s working. I’ve got 1-second stats running now, so hopefully I can catch it misbehaving.
My multisite instance fairly reliably has the database die when I do a bootstrap of the web container when it’s rebuilding the CSS. It’s running in the tens of ops/sec & once it dies hangs redis-cli won’t give me any stats. I’ll try to see what’s up with that next.
What and where can one tweak on the redis config?
Hmm, yeah, I guess once Redis stops talking to the world, you’re going to have a hard time connecting to it to get stats… although if you run
redis-cli while everything is hunky-dory, then stay connected, you should be able to keep running
info stats (assuming Redis hasn’t completely lost its tiny mind).
Another thing to check is the Redis logs; from memory, I think we’ve got Redis logging to Docker, so
./launcher logs app is what you want there. If Redis is really playing up hard, that should tell you what’s what.
I’d also like to definitively rule out swapping as a contributing factor; can you collect the output of
free -m from just before and just after the problem occurs, and run
sar -W 1 (on the host) before and after .
Well, I got the multisite instance to crash the database again during a rebuild when compiling css.
sar showed a couple seconds of 30 swaps, but free still said only about 150M of the 2027M of swap available. ops/sec was less than 100 up until it died. I’m pretty stumped on this one. I keep thinking that it’s my imagination, but it always dies on a rebuild and most nights it crashes a couple times. It didn’t crash the one night I had all of the backups turned off.
Meanwhile, the big sidekiq backlog site keeps functioning.
Redis::TimeoutError problems appear to be caused by a bug in Docker. I’ve just done a writeup on my findings from another misbehaving machine, which might be useful to you.
After further investigation, I’ve confirmed a bug in Docker, and it would be entirely capable of causing all of the seemingly unrelated symptoms you described. The linked post has more details and a rundown of how to downgrade Docker, which I’m pretty confident will solve all these problems.
Do you know if the Docker bug is still around in 18.03.1~ce-0~ubuntu? I am seeing the following and am wondering if this is the same issue again:
Unexpected error in Message Bus : Connection timed out
Job exception: Connection timed out
Failed to process job: Connection timed out ["/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/connection/hiredis.rb:58:in `rescue in read'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/connection/hiredis.rb:53:in `read'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:260:in `block in read'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:248:in `io'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:259:in `read'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:118:in `block in call'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:229:in `block (2 levels) in process'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:366:in `ensure_connected'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:219:in `block in process'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:304:in `logging'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:218:in `process'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:118:in `call'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:2448:in `block in _eval'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:45:in `block in synchronize'", "/usr/local/lib/ruby/2.5.0/monitor.rb:226:in `mon_synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:45:in `synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:2447:in `_eval'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:2499:in `evalsha'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus/backends/redis.rb:381:in `cached_eval'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus/backends/redis.rb:141:in `publish'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus.rb:253:in `publish'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus.rb:490:in `block in new_subscriber_thread'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus/timer_thread.rb:102:in `do_work'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus/timer_thread.rb:30:in `block in initialize'"] Unexpected error in Message Bus : Connection timed out Job exception: Connection timed out
E, [2018-06-27T11:55:52.507207 #31436] ERROR -- : master loop error: Connection timed out (Redis::TimeoutError) E, [2018-06-27T11:55:52.507357 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/connection/hiredis.rb:58:in `rescue in read' E, [2018-06-27T11:55:52.507423 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/connection/hiredis.rb:53:in `read' E, [2018-06-27T11:55:52.507464 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:260:in `block in read' E, [2018-06-27T11:55:52.507503 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:248:in `io' E, [2018-06-27T11:55:52.507536 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:259:in `read' E, [2018-06-27T11:55:52.507560 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:118:in `block in call' E, [2018-06-27T11:55:52.507583 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:229:in `block (2 levels) in process' E, [2018-06-27T11:55:52.507609 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:366:in `ensure_connected' E, [2018-06-27T11:55:52.507630 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:219:in `block in process' E, [2018-06-27T11:55:52.507655 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:304:in `logging' E, [2018-06-27T11:55:52.507674 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:218:in `process' E, [2018-06-27T11:55:52.507698 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:118:in `call' E, [2018-06-27T11:55:52.507716 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis.rb:889:in `block in get' E, [2018-06-27T11:55:52.507736 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis.rb:45:in `block in synchronize' E, [2018-06-27T11:55:52.507753 #31436] ERROR -- : /usr/local/lib/ruby/2.4.0/monitor.rb:214:in `mon_synchronize' E, [2018-06-27T11:55:52.507781 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis.rb:45:in `synchronize' E, [2018-06-27T11:55:52.507804 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis.rb:888:in `get' E, [2018-06-27T11:55:52.507822 #31436] ERROR -- : /var/www/discourse/lib/discourse_redis.rb:188:in `block (3 levels) in <class:DiscourseRedis>' E, [2018-06-27T11:55:52.507838 #31436] ERROR -- : /var/www/discourse/lib/discourse_redis.rb:153:in `ignore_readonly' E, [2018-06-27T11:55:52.507855 #31436] ERROR -- : /var/www/discourse/lib/discourse_redis.rb:188:in `block (2 levels) in <class:DiscourseRedis>' E, [2018-06-27T11:55:52.507876 #31436] ERROR -- : /var/www/discourse/app/jobs/regular/run_heartbeat.rb:15:in `last_heartbeat' E, [2018-06-27T11:55:52.507896 #31436] ERROR -- : config/unicorn.conf.rb:172:in `check_sidekiq_heartbeat' E, [2018-06-27T11:55:52.507915 #31436] ERROR -- : config/unicorn.conf.rb:199:in `master_sleep' E, [2018-06-27T11:55:52.507933 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/unicorn-5.4.0/lib/unicorn/http_server.rb:294:in `join' E, [2018-06-27T11:55:52.507954 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/unicorn-5.4.0/bin/unicorn:126:in `<top (required)>' E, [2018-06-27T11:55:52.507971 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/bin/unicorn:23:in `load' E, [2018-06-27T11:55:52.507987 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/bin/unicorn:23:in `<main>' Failed to report error: Connection timed out 3 Job exception: Connection timed out
apt-cache showpkg docker-ce shows me versions 17.09 and 17.12, but 17.10 is not available in my repo. Are you aware of any other version besides 17.10 which does not have problems with Discourse? (assuming mine is the same problem)
Edit: Some more info: Discourse was running rock-solid since a month, no problems. Yesterday evening I rebuilt the app and did a upgrade / dist-upgrade during a short maint window. 18 hours later the forum died. Host ressources are fine.
Now I restored a snapshot from before the upgrade, and saw that I was running the exact same Docker version before the problems started. This pretty much excludes the Docker version as the culprit. The other variables are app upgrade from 1.9 to 2.1. Also, updated host OS. Still, from the logs, I don’t see how the app could cause such low level problems (cant read Redis). So what is going on here? Maybe some dependencies of Docker changed while updating, which are causing the problem? Something inside the container going wrong? I am clueless…
Edit 2: This is what was going on from a resources point of view when it crashed:
But there wasn’t much going on at that point in forum. Something spiked memory though, albeit not in critical range.
I’d recommend un-installing the docker-ce that comes with Ubuntu and installing Docker this way:
wget -qO- https://get.docker.com/ | sh
That’s how it was installed in the first place. But see my edit I did (after your post). I believe that the Docker version can nearly be excluded as this point (at least in my case)…
Ugh, site crashed again! It is crashing about every 18 hours now since the update to 2.1, always because of Redis (Redis logs show nothing abnormal). I even removed all plugins now, still crashing. Does anyone have any good tips how I could investigate this?
Are you running out of memory?
According to monitoring, no (see chart above). I was also thinking about this, that perhaps Linux’s out-of-memory killer is randomly killing processes. But I have no indication that this is the case. It is also interesting that the server typically runs for about 18 hours then goes down (with Redis connection errors in log well before this). Any idea how I could verify if processes are being killed due to memory pressure?
oom-killer events appear in syslog “Out of memory: Kill process BLAH BLAH”
Thanks Found nothing in syslog, so I guess I can check that off as well …
Was just reading through the official docker install guide, and found this statement regarding the convenience script that you suggested:
Using these scripts is not recommended for production environments
Are you sure installing Docker with this script for a production forum is a good idea?
It’s not a good idea, but it is the least-worst idea.
I’m sure that I’ve done hundreds of installs and the bay majority of them are working just fine.
If you know a better way to install docker for your purposes, you should do that. I generally try to support people who know nothing about system administration. It makes no sense to me, but last I checked, three version of docker installed by default on Ubuntu was wildly out of date.
I guess what I am asking is the following: the convenience script installs from the Edge channel, while Docker recommends the Stable channel for production. Have you found any specific advantages of running Edge as opposed to Stable in the context of Discourse? As I do most of my setups with Ansible anyways, I am not worried about convenience, just am looking for what works best.
A couple years back, the stable channel seemed not so useful, and far behind.
Perhaps it’s time to reconsider using the docker that’s in the default Ubuntu distribution. Perhaps when 18.04 seems safe. @mpalmer?