Redis connection timed out


(Jay Pfaffman) #1

I just rebuilt a site with 8GB ram. Apparently sidekiq had died a few days ago and is now trying to catch up on 500K queued processes.

It runs for a while and then redis stops responding. Logs have errors like these:

Error connecting to Redis on localhost:6379 (Redis::TimeoutError) subscribe failed, reconnecting in 1 s
econd. Call stack ["/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:34
5:in `rescue in establish_connection'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/l
ib/redis/client.rb:331:in `establish_connection'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/re
dis-3.3.5/lib/redis/client.rb:101:in `block in connect'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/
gems/redis-3.3.5/lib/redis/client.rb:293:in `with_reconnect'", "/var/www/discourse/vendor/bundle/ruby/2
.4.0/gems/redis-3.3.5/lib/redis/client.rb:100:in `connect'", "/var/www/discourse/vendor/bundle/ruby/2.4
.0/gems/redis-3.3.5/lib/redis/client.rb:276:in `with_socket_timeout'", "/var/www/discourse/vendor/bundl
e/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:133:in `call_loop'", "/var/www/discourse/vendor/bundl
e/ruby/2.4.0/gems/redis-3.3.5/lib/redis/subscribe.rb:43:in `subscription'", "/var/www/discourse/vendor/
bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/subscribe.rb:12:in `subscribe'", "/var/www/discourse/vendo
r/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis.rb:2775:in `_subscription'", "/var/www/discourse/vendor/
bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis.rb:2143:in `block in subscribe'", "/var/www/discourse/vend
or/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis.rb:58:in `block in synchronize'", "/usr/local/lib/ruby/
2.4.0/monitor.rb:214:in `mon_synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3
.5/lib/redis.rb:58:in `synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib
/redis.rb:2142:in `subscribe'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/message_bus-2.1.1/lib
/message_bus/backends/redis.rb:336:in `global_subscribe'", "/var/www/discourse/vendor/bundle/ruby/2.4.0
/gems/message_bus-2.1.1/lib/message_bus.rb:525:in `global_subscribe_thread'", "/var/www/discourse/vendo
r/bundle/ruby/2.4.0/gems/message_bus-2.1.1/lib/message_bus.rb:473:in `block in new_subscriber_thread'"]
Job exception: Connection timed out

Job exception: Connection timed out

Job exception: Error connecting to Redis on localhost:6379 (Redis::TimeoutError)

Job exception: Connection timed out

Job exception: Connection timed out

I’ve got this set now:

  db_shared_buffers: "2048MB"
  db_work_mem: "50MB"

I’m going to try doubling db_shared_buffers next. Edit: but that didn’t help either.

My multisite instance has also been having the database die several times a day. I thought it was coincident with backups, but I’m not convinced that’s true.

Has something changed about redis in the past couple weeks that could affect this?


Web interface timeouts (redis timeouts)
(Rafael dos Santos Silva) #2

Those are for PostgreSQL, not redis.

Is this site high traffic? What is the value of redis size and ops per second during load?

I found a self hosted site failing hard to keep up because of redis persistence process, and tweaking it made everything smooth again.


(Jay Pfaffman) #3

Thanks, @Falco.

That sounds like what I need to fix, but I know not of what you speak. It’s about 15K page views/day, but right now there are a 400K sidekiq jobs clogging things up.

Uh, where do I find redis size and ops per second?


#4

OPS per second can be checked using the iostat command.
There is very less chance that it will be installed by default.
So you might have to install it first.


(Jay Pfaffman) #5
Linux 4.4.0-78-generic (site.com)   12/18/2017      _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          18.90    0.01    6.54    0.13    1.11   73.31

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              24.24       140.35      1943.41  132914772 1840484564

#6

Could you please try the below command for 10 seconds and then share the output.

iostat 1

We should be looking at the tps column


(Jay Pfaffman) #7

thanks, @nixie!

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          44.00    0.00    6.50    0.25    0.50   48.75

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              28.00        16.00       248.00         16        248

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          34.94    0.00    4.81    0.25    0.76   59.24

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              27.00        16.00       316.00         16        316

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          31.14    0.00    4.81    0.25    0.51   63.29

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              22.00        24.00       256.00         24        256

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          32.66    0.00    5.32    0.00    0.25   61.77

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              36.00       104.00       264.00        104        264

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          42.32    0.00    6.55    0.25    0.50   50.38

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              22.00         8.00       196.00          8        196

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          32.33    0.00    4.76    0.25    0.50   62.16

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              22.00        24.00       208.00         24        208

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          35.19    0.00    6.84    0.00    1.01   56.96

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              19.00        16.00       192.00         16        192

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          34.78    0.00    3.58    0.00    0.00   61.64

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              28.00        48.00       208.00         48        208

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          41.94    0.00    5.37    0.77    0.51   51.41

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              70.00       464.00       220.00        464        220

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          51.75    0.00    9.50    0.25    1.00   37.50

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              53.47       229.70       150.50        232        152

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          38.01    0.00    6.38    2.55    1.53   51.53

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              26.26        64.65       185.86         64        184

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          36.00    0.00    6.75    0.00    2.75   54.50

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              19.80        23.76       312.87         24        316

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          42.24    0.00    6.36    0.00    1.53   49.87

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              10.00         0.00       104.00          0        104

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          33.17    0.00    5.45    0.25    1.98   59.16

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              19.00         0.00       412.00          0        412

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          31.47    0.00    6.09    0.25    0.51   61.68

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              35.00         8.00       304.00          8        304

Sometimes it’s >200 and a few times it’s >2000.


(Matt Palmer) #8

Disk IOPS aren’t going to be relevant to Redis’ operation, because it’s an in-memory data store. What Falco’s asking for is Redis operations per second. From memory, redis-cli info stats from inside the container should do the trick.


(Jay Pfaffman) #9
# Stats
total_connections_received:56
total_commands_processed:433486
instantaneous_ops_per_sec:837
total_net_input_bytes:34247556
total_net_output_bytes:25502898
instantaneous_input_kbps:59.92
instantaneous_output_kbps:8.48
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:518
evicted_keys:0
keyspace_hits:265122
keyspace_misses:94064
pubsub_channels:1
pubsub_patterns:0
latest_fork_usec:9114
migrate_cached_sockets:0

Looks like it’s hanging in the 300-600 range, occasionally going over 1000. And the watched pot syndrome is in effect. This is the longest the site has stayed up.


(Jeff Atwood) #10

Could be, if large amount of key-value churn is causing frequent dumps of the entire Redis datastore to disk.


(Matt Palmer) #11

Nope, because dumping the DB happens in a separate process from serving requests. Only CPU starvation and memory pressure (swapping) can cause Redis to slow down. Neither of those appear to be the case, based on the iostat data provided (CPU 50+% idle, IOPS fairly low).

That’s a respectable rate of operations, but it shouldn’t be enough to cause Redis is collapse under its own weight.

Are you collecting all this data while the system is misbehaving, or during correct operation? Stats while the system is running properly aren’t as likely to show problems.


(Jay Pfaffman) #12

Yes, this is while it’s working. I’ve got 1-second stats running now, so hopefully I can catch it misbehaving.

My multisite instance fairly reliably has the database die when I do a bootstrap of the web container when it’s rebuilding the CSS. It’s running in the tens of ops/sec & once it dies hangs redis-cli won’t give me any stats. I’ll try to see what’s up with that next.

What and where can one tweak on the redis config?


(Matt Palmer) #13

Hmm, yeah, I guess once Redis stops talking to the world, you’re going to have a hard time connecting to it to get stats… although if you run redis-cli while everything is hunky-dory, then stay connected, you should be able to keep running info stats (assuming Redis hasn’t completely lost its tiny mind).

Another thing to check is the Redis logs; from memory, I think we’ve got Redis logging to Docker, so ./launcher logs app is what you want there. If Redis is really playing up hard, that should tell you what’s what.

I’d also like to definitively rule out swapping as a contributing factor; can you collect the output of free -m from just before and just after the problem occurs, and run sar -W 1 (on the host) before and after :fire:.


(Jay Pfaffman) #14

Well, I got the multisite instance to crash the database again during a rebuild when compiling css. sar showed a couple seconds of 30 swaps, but free still said only about 150M of the 2027M of swap available. ops/sec was less than 100 up until it died. I’m pretty stumped on this one. I keep thinking that it’s my imagination, but it always dies on a rebuild and most nights it crashes a couple times. It didn’t crash the one night I had all of the backups turned off.

Meanwhile, the big sidekiq backlog site keeps functioning.


(Matt Palmer) #15

Hi Jay,

The Redis::TimeoutError problems appear to be caused by a bug in Docker. I’ve just done a writeup on my findings from another misbehaving machine, which might be useful to you.


(Matt Palmer) #16

After further investigation, I’ve confirmed a bug in Docker, and it would be entirely capable of causing all of the seemingly unrelated symptoms you described. The linked post has more details and a rundown of how to downgrade Docker, which I’m pretty confident will solve all these problems.


(Allen) #17

Do you know if the Docker bug is still around in 18.03.1~ce-0~ubuntu? I am seeing the following and am wondering if this is the same issue again:

production.log
Unexpected error in Message Bus : Connection timed out
Job exception: Connection timed out

Failed to process job: Connection timed out ["/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/connection/hiredis.rb:58:in `rescue in read'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/connection/hiredis.rb:53:in `read'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:260:in `block in read'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:248:in `io'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:259:in `read'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:118:in `block in call'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:229:in `block (2 levels) in process'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:366:in `ensure_connected'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:219:in `block in process'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:304:in `logging'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:218:in `process'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:118:in `call'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:2448:in `block in _eval'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:45:in `block in synchronize'", "/usr/local/lib/ruby/2.5.0/monitor.rb:226:in `mon_synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:45:in `synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:2447:in `_eval'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:2499:in `evalsha'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus/backends/redis.rb:381:in `cached_eval'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus/backends/redis.rb:141:in `publish'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus.rb:253:in `publish'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus.rb:490:in `block in new_subscriber_thread'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus/timer_thread.rb:102:in `do_work'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus/timer_thread.rb:30:in `block in initialize'"]
Unexpected error in Message Bus : Connection timed out
Job exception: Connection timed out

unicorn.stderr.log

E, [2018-06-27T11:55:52.507207 #31436] ERROR -- : master loop error: Connection timed out (Redis::TimeoutError)
E, [2018-06-27T11:55:52.507357 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/connection/hiredis.rb:58:in `rescue in read'
E, [2018-06-27T11:55:52.507423 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/connection/hiredis.rb:53:in `read'
E, [2018-06-27T11:55:52.507464 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:260:in `block in read'
E, [2018-06-27T11:55:52.507503 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:248:in `io'
E, [2018-06-27T11:55:52.507536 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:259:in `read'
E, [2018-06-27T11:55:52.507560 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:118:in `block in call'
E, [2018-06-27T11:55:52.507583 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:229:in `block (2 levels) in process'
E, [2018-06-27T11:55:52.507609 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:366:in `ensure_connected'
E, [2018-06-27T11:55:52.507630 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:219:in `block in process'
E, [2018-06-27T11:55:52.507655 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:304:in `logging'
E, [2018-06-27T11:55:52.507674 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:218:in `process'
E, [2018-06-27T11:55:52.507698 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:118:in `call'
E, [2018-06-27T11:55:52.507716 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis.rb:889:in `block in get'
E, [2018-06-27T11:55:52.507736 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis.rb:45:in `block in synchronize'
E, [2018-06-27T11:55:52.507753 #31436] ERROR -- : /usr/local/lib/ruby/2.4.0/monitor.rb:214:in `mon_synchronize'
E, [2018-06-27T11:55:52.507781 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis.rb:45:in `synchronize'
E, [2018-06-27T11:55:52.507804 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis.rb:888:in `get'
E, [2018-06-27T11:55:52.507822 #31436] ERROR -- : /var/www/discourse/lib/discourse_redis.rb:188:in `block (3 levels) in <class:DiscourseRedis>'
E, [2018-06-27T11:55:52.507838 #31436] ERROR -- : /var/www/discourse/lib/discourse_redis.rb:153:in `ignore_readonly'
E, [2018-06-27T11:55:52.507855 #31436] ERROR -- : /var/www/discourse/lib/discourse_redis.rb:188:in `block (2 levels) in <class:DiscourseRedis>'
E, [2018-06-27T11:55:52.507876 #31436] ERROR -- : /var/www/discourse/app/jobs/regular/run_heartbeat.rb:15:in `last_heartbeat'
E, [2018-06-27T11:55:52.507896 #31436] ERROR -- : config/unicorn.conf.rb:172:in `check_sidekiq_heartbeat'
E, [2018-06-27T11:55:52.507915 #31436] ERROR -- : config/unicorn.conf.rb:199:in `master_sleep'
E, [2018-06-27T11:55:52.507933 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/unicorn-5.4.0/lib/unicorn/http_server.rb:294:in `join'
E, [2018-06-27T11:55:52.507954 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/unicorn-5.4.0/bin/unicorn:126:in `<top (required)>'
E, [2018-06-27T11:55:52.507971 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/bin/unicorn:23:in `load'
E, [2018-06-27T11:55:52.507987 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/bin/unicorn:23:in `<main>'
Failed to report error: Connection timed out 3 Job exception: Connection timed out

apt-cache showpkg docker-ce shows me versions 17.09 and 17.12, but 17.10 is not available in my repo. Are you aware of any other version besides 17.10 which does not have problems with Discourse? (assuming mine is the same problem)

Edit: Some more info: Discourse was running rock-solid since a month, no problems. Yesterday evening I rebuilt the app and did a upgrade / dist-upgrade during a short maint window. 18 hours later the forum died. Host ressources are fine.

Now I restored a snapshot from before the upgrade, and saw that I was running the exact same Docker version before the problems started. This pretty much excludes the Docker version as the culprit. The other variables are app upgrade from 1.9 to 2.1. Also, updated host OS. Still, from the logs, I don’t see how the app could cause such low level problems (cant read Redis). So what is going on here? Maybe some dependencies of Docker changed while updating, which are causing the problem? Something inside the container going wrong? I am clueless…

Edit 2: This is what was going on from a resources point of view when it crashed:


But there wasn’t much going on at that point in forum. Something spiked memory though, albeit not in critical range.


Regular Discourse crashes since update from 1.9 to 2.1
(Jay Pfaffman) #18

I’d recommend un-installing the docker-ce that comes with Ubuntu and installing Docker this way:

   wget -qO- https://get.docker.com/ | sh

(Allen) #19

That’s how it was installed in the first place. But see my edit I did (after your post). I believe that the Docker version can nearly be excluded as this point (at least in my case)…


(Allen) #20

Ugh, site crashed again! It is crashing about every 18 hours now since the update to 2.1, always because of Redis (Redis logs show nothing abnormal). I even removed all plugins now, still crashing. Does anyone have any good tips how I could investigate this?