Redis connection timed out

pfaffman · December 18, 2017, 8:58pm

I just rebuilt a site with 8GB ram. Apparently sidekiq had died a few days ago and is now trying to catch up on 500K queued processes.

It runs for a while and then redis stops responding. Logs have errors like these:

Error connecting to Redis on localhost:6379 (Redis::TimeoutError) subscribe failed, reconnecting in 1 s
econd. Call stack ["/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:34
5:in `rescue in establish_connection'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/l
ib/redis/client.rb:331:in `establish_connection'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/re
dis-3.3.5/lib/redis/client.rb:101:in `block in connect'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/
gems/redis-3.3.5/lib/redis/client.rb:293:in `with_reconnect'", "/var/www/discourse/vendor/bundle/ruby/2
.4.0/gems/redis-3.3.5/lib/redis/client.rb:100:in `connect'", "/var/www/discourse/vendor/bundle/ruby/2.4
.0/gems/redis-3.3.5/lib/redis/client.rb:276:in `with_socket_timeout'", "/var/www/discourse/vendor/bundl
e/ruby/2.4.0/gems/redis-3.3.5/lib/redis/client.rb:133:in `call_loop'", "/var/www/discourse/vendor/bundl
e/ruby/2.4.0/gems/redis-3.3.5/lib/redis/subscribe.rb:43:in `subscription'", "/var/www/discourse/vendor/
bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis/subscribe.rb:12:in `subscribe'", "/var/www/discourse/vendo
r/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis.rb:2775:in `_subscription'", "/var/www/discourse/vendor/
bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis.rb:2143:in `block in subscribe'", "/var/www/discourse/vend
or/bundle/ruby/2.4.0/gems/redis-3.3.5/lib/redis.rb:58:in `block in synchronize'", "/usr/local/lib/ruby/
2.4.0/monitor.rb:214:in `mon_synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3
.5/lib/redis.rb:58:in `synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.5/lib
/redis.rb:2142:in `subscribe'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/message_bus-2.1.1/lib
/message_bus/backends/redis.rb:336:in `global_subscribe'", "/var/www/discourse/vendor/bundle/ruby/2.4.0
/gems/message_bus-2.1.1/lib/message_bus.rb:525:in `global_subscribe_thread'", "/var/www/discourse/vendo
r/bundle/ruby/2.4.0/gems/message_bus-2.1.1/lib/message_bus.rb:473:in `block in new_subscriber_thread'"]
Job exception: Connection timed out

Job exception: Connection timed out

Job exception: Error connecting to Redis on localhost:6379 (Redis::TimeoutError)

Job exception: Connection timed out

Job exception: Connection timed out

I’ve got this set now:

  db_shared_buffers: "2048MB"
  db_work_mem: "50MB"

I’m going to try doubling db_shared_buffers next. Edit: but that didn’t help either.

My multisite instance has also been having the database die several times a day. I thought it was coincident with backups, but I’m not convinced that’s true.

Has something changed about redis in the past couple weeks that could affect this?

Falco · December 18, 2017, 10:06pm

Those are for PostgreSQL, not redis.

Is this site high traffic? What is the value of redis size and ops per second during load?

I found a self hosted site failing hard to keep up because of redis persistence process, and tweaking it made everything smooth again.

pfaffman · December 18, 2017, 10:30pm

Thanks, @Falco.

That sounds like what I need to fix, but I know not of what you speak. It’s about 15K page views/day, but right now there are a 400K sidekiq jobs clogging things up.

Uh, where do I find redis size and ops per second?

nixie · December 18, 2017, 10:34pm

OPS per second can be checked using the iostat command.
There is very less chance that it will be installed by default.
So you might have to install it first.

pfaffman · December 18, 2017, 10:39pm

Linux 4.4.0-78-generic (site.com)   12/18/2017      _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          18.90    0.01    6.54    0.13    1.11   73.31

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              24.24       140.35      1943.41  132914772 1840484564

nixie · December 18, 2017, 10:40pm

Could you please try the below command for 10 seconds and then share the output.

iostat 1

We should be looking at the tps column

pfaffman · December 18, 2017, 10:44pm

thanks, @nixie!

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          44.00    0.00    6.50    0.25    0.50   48.75

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              28.00        16.00       248.00         16        248

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          34.94    0.00    4.81    0.25    0.76   59.24

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              27.00        16.00       316.00         16        316

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          31.14    0.00    4.81    0.25    0.51   63.29

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              22.00        24.00       256.00         24        256

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          32.66    0.00    5.32    0.00    0.25   61.77

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              36.00       104.00       264.00        104        264

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          42.32    0.00    6.55    0.25    0.50   50.38

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              22.00         8.00       196.00          8        196

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          32.33    0.00    4.76    0.25    0.50   62.16

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              22.00        24.00       208.00         24        208

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          35.19    0.00    6.84    0.00    1.01   56.96

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              19.00        16.00       192.00         16        192

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          34.78    0.00    3.58    0.00    0.00   61.64

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              28.00        48.00       208.00         48        208

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          41.94    0.00    5.37    0.77    0.51   51.41

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              70.00       464.00       220.00        464        220

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          51.75    0.00    9.50    0.25    1.00   37.50

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              53.47       229.70       150.50        232        152

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          38.01    0.00    6.38    2.55    1.53   51.53

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              26.26        64.65       185.86         64        184

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          36.00    0.00    6.75    0.00    2.75   54.50

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              19.80        23.76       312.87         24        316

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          42.24    0.00    6.36    0.00    1.53   49.87

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              10.00         0.00       104.00          0        104

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          33.17    0.00    5.45    0.25    1.98   59.16

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              19.00         0.00       412.00          0        412

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          31.47    0.00    6.09    0.25    0.51   61.68

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda              35.00         8.00       304.00          8        304

Sometimes it’s >200 and a few times it’s >2000.

mpalmer · December 18, 2017, 10:50pm

Disk IOPS aren’t going to be relevant to Redis’ operation, because it’s an in-memory data store. What Falco’s asking for is Redis operations per second. From memory, redis-cli info stats from inside the container should do the trick.

pfaffman · December 18, 2017, 10:50pm

# Stats
total_connections_received:56
total_commands_processed:433486
instantaneous_ops_per_sec:837
total_net_input_bytes:34247556
total_net_output_bytes:25502898
instantaneous_input_kbps:59.92
instantaneous_output_kbps:8.48
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:518
evicted_keys:0
keyspace_hits:265122
keyspace_misses:94064
pubsub_channels:1
pubsub_patterns:0
latest_fork_usec:9114
migrate_cached_sockets:0

Looks like it’s hanging in the 300-600 range, occasionally going over 1000. And the watched pot syndrome is in effect. This is the longest the site has stayed up.

codinghorror · December 18, 2017, 11:07pm

Could be, if large amount of key-value churn is causing frequent dumps of the entire Redis datastore to disk.

mpalmer · December 18, 2017, 11:16pm

Nope, because dumping the DB happens in a separate process from serving requests. Only CPU starvation and memory pressure (swapping) can cause Redis to slow down. Neither of those appear to be the case, based on the iostat data provided (CPU 50+% idle, IOPS fairly low).

That’s a respectable rate of operations, but it shouldn’t be enough to cause Redis is collapse under its own weight.

Are you collecting all this data while the system is misbehaving, or during correct operation? Stats while the system is running properly aren’t as likely to show problems.

pfaffman · December 18, 2017, 11:21pm

Yes, this is while it’s working. I’ve got 1-second stats running now, so hopefully I can catch it misbehaving.

My multisite instance fairly reliably has the database die when I do a bootstrap of the web container when it’s rebuilding the CSS. It’s running in the tens of ops/sec & once it dies hangs redis-cli won’t give me any stats. I’ll try to see what’s up with that next.

What and where can one tweak on the redis config?

mpalmer · December 18, 2017, 11:26pm

Hmm, yeah, I guess once Redis stops talking to the world, you’re going to have a hard time connecting to it to get stats… although if you run redis-cli while everything is hunky-dory, then stay connected, you should be able to keep running info stats (assuming Redis hasn’t completely lost its tiny mind).

Another thing to check is the Redis logs; from memory, I think we’ve got Redis logging to Docker, so ./launcher logs app is what you want there. If Redis is really playing up hard, that should tell you what’s what.

I’d also like to definitively rule out swapping as a contributing factor; can you collect the output of free -m from just before and just after the problem occurs, and run sar -W 1 (on the host) before and after .

pfaffman · December 18, 2017, 11:44pm

Well, I got the multisite instance to crash the database again during a rebuild when compiling css. sar showed a couple seconds of 30 swaps, but free still said only about 150M of the 2027M of swap available. ops/sec was less than 100 up until it died. I’m pretty stumped on this one. I keep thinking that it’s my imagination, but it always dies on a rebuild and most nights it crashes a couple times. It didn’t crash the one night I had all of the backups turned off.

Meanwhile, the big sidekiq backlog site keeps functioning.

mpalmer · December 23, 2017, 12:24pm

Hi Jay,

The Redis::TimeoutError problems appear to be caused by a bug in Docker. I’ve just done a writeup on my findings from another misbehaving machine, which might be useful to you.

mpalmer · December 24, 2017, 3:12am

After further investigation, I’ve confirmed a bug in Docker, and it would be entirely capable of causing all of the seemingly unrelated symptoms you described. The linked post has more details and a rundown of how to downgrade Docker, which I’m pretty confident will solve all these problems.

allu · June 27, 2018, 5:12pm

Do you know if the Docker bug is still around in 18.03.1~ce-0~ubuntu? I am seeing the following and am wondering if this is the same issue again:

production.log
Unexpected error in Message Bus : Connection timed out
Job exception: Connection timed out

Failed to process job: Connection timed out ["/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/connection/hiredis.rb:58:in `rescue in read'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/connection/hiredis.rb:53:in `read'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:260:in `block in read'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:248:in `io'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:259:in `read'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:118:in `block in call'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:229:in `block (2 levels) in process'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:366:in `ensure_connected'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:219:in `block in process'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:304:in `logging'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:218:in `process'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:118:in `call'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:2448:in `block in _eval'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:45:in `block in synchronize'", "/usr/local/lib/ruby/2.5.0/monitor.rb:226:in `mon_synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:45:in `synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:2447:in `_eval'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/redis-4.0.1/lib/redis.rb:2499:in `evalsha'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus/backends/redis.rb:381:in `cached_eval'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus/backends/redis.rb:141:in `publish'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus.rb:253:in `publish'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus.rb:490:in `block in new_subscriber_thread'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus/timer_thread.rb:102:in `do_work'", "/var/www/discourse/vendor/bundle/ruby/2.5.0/gems/message_bus-2.1.5/lib/message_bus/timer_thread.rb:30:in `block in initialize'"]
Unexpected error in Message Bus : Connection timed out
Job exception: Connection timed out

unicorn.stderr.log

E, [2018-06-27T11:55:52.507207 #31436] ERROR -- : master loop error: Connection timed out (Redis::TimeoutError)
E, [2018-06-27T11:55:52.507357 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/connection/hiredis.rb:58:in `rescue in read'
E, [2018-06-27T11:55:52.507423 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/connection/hiredis.rb:53:in `read'
E, [2018-06-27T11:55:52.507464 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:260:in `block in read'
E, [2018-06-27T11:55:52.507503 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:248:in `io'
E, [2018-06-27T11:55:52.507536 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:259:in `read'
E, [2018-06-27T11:55:52.507560 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:118:in `block in call'
E, [2018-06-27T11:55:52.507583 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:229:in `block (2 levels) in process'
E, [2018-06-27T11:55:52.507609 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:366:in `ensure_connected'
E, [2018-06-27T11:55:52.507630 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:219:in `block in process'
E, [2018-06-27T11:55:52.507655 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:304:in `logging'
E, [2018-06-27T11:55:52.507674 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:218:in `process'
E, [2018-06-27T11:55:52.507698 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis/client.rb:118:in `call'
E, [2018-06-27T11:55:52.507716 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis.rb:889:in `block in get'
E, [2018-06-27T11:55:52.507736 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis.rb:45:in `block in synchronize'
E, [2018-06-27T11:55:52.507753 #31436] ERROR -- : /usr/local/lib/ruby/2.4.0/monitor.rb:214:in `mon_synchronize'
E, [2018-06-27T11:55:52.507781 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis.rb:45:in `synchronize'
E, [2018-06-27T11:55:52.507804 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-4.0.1/lib/redis.rb:888:in `get'
E, [2018-06-27T11:55:52.507822 #31436] ERROR -- : /var/www/discourse/lib/discourse_redis.rb:188:in `block (3 levels) in <class:DiscourseRedis>'
E, [2018-06-27T11:55:52.507838 #31436] ERROR -- : /var/www/discourse/lib/discourse_redis.rb:153:in `ignore_readonly'
E, [2018-06-27T11:55:52.507855 #31436] ERROR -- : /var/www/discourse/lib/discourse_redis.rb:188:in `block (2 levels) in <class:DiscourseRedis>'
E, [2018-06-27T11:55:52.507876 #31436] ERROR -- : /var/www/discourse/app/jobs/regular/run_heartbeat.rb:15:in `last_heartbeat'
E, [2018-06-27T11:55:52.507896 #31436] ERROR -- : config/unicorn.conf.rb:172:in `check_sidekiq_heartbeat'
E, [2018-06-27T11:55:52.507915 #31436] ERROR -- : config/unicorn.conf.rb:199:in `master_sleep'
E, [2018-06-27T11:55:52.507933 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/unicorn-5.4.0/lib/unicorn/http_server.rb:294:in `join'
E, [2018-06-27T11:55:52.507954 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/gems/unicorn-5.4.0/bin/unicorn:126:in `<top (required)>'
E, [2018-06-27T11:55:52.507971 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/bin/unicorn:23:in `load'
E, [2018-06-27T11:55:52.507987 #31436] ERROR -- : /var/www/discourse/vendor/bundle/ruby/2.4.0/bin/unicorn:23:in `<main>'
Failed to report error: Connection timed out 3 Job exception: Connection timed out

apt-cache showpkg docker-ce shows me versions 17.09 and 17.12, but 17.10 is not available in my repo. Are you aware of any other version besides 17.10 which does not have problems with Discourse? (assuming mine is the same problem)

Edit: Some more info: Discourse was running rock-solid since a month, no problems. Yesterday evening I rebuilt the app and did a upgrade / dist-upgrade during a short maint window. 18 hours later the forum died. Host ressources are fine.

Now I restored a snapshot from before the upgrade, and saw that I was running the exact same Docker version before the problems started. This pretty much excludes the Docker version as the culprit. The other variables are app upgrade from 1.9 to 2.1. Also, updated host OS. Still, from the logs, I don’t see how the app could cause such low level problems (cant read Redis). So what is going on here? Maybe some dependencies of Docker changed while updating, which are causing the problem? Something inside the container going wrong? I am clueless…

Edit 2: This is what was going on from a resources point of view when it crashed:

But there wasn’t much going on at that point in forum. Something spiked memory though, albeit not in critical range.

pfaffman · June 27, 2018, 6:02pm

I’d recommend un-installing the docker-ce that comes with Ubuntu and installing Docker this way:

   wget -qO- https://get.docker.com/ | sh

allu · June 27, 2018, 6:15pm

That’s how it was installed in the first place. But see my edit I did (after your post). I believe that the Docker version can nearly be excluded as this point (at least in my case)…

allu · June 28, 2018, 11:39am

Ugh, site crashed again! It is crashing about every 18 hours now since the update to 2.1, always because of Redis (Redis logs show nothing abnormal). I even removed all plugins now, still crashing. Does anyone have any good tips how I could investigate this?

Topic		Replies	Views
Discourse web interface becomes unresponsive a few minutes after starting Installation	35	6711	January 9, 2018
Issue with new update: debugging options Installation	16	155	August 30, 2024
Error connecting to Redis Installation	17	6917	March 21, 2023
Could sidekiq queue be reason for 500 errors? Installation server-resources	31	3805	July 13, 2018
Redis Error after upgrade Installation	14	1562	July 3, 2020

Redis connection timed out

Related topics