连接外部 Redis 时出错

您好,社区

我们在 Kubernetes 中设置了 Discourse。我们正在运行一个包含 Discourse 容器镜像的部署,使用外部 Redis 和外部 PostgreSQL。
为了管理 Redis 高可用性,我们在由 Sentinel 管理的 StatefulSet 中部署了 3 个节点。此外,我们在 Discourse 和 Redis 之间添加了一个 HAProxy 部署,配置如下:

global
      log stdout format raw local0
      maxconn 102400
      maxsessrate 500
      maxconnrate 2000
      maxsslrate 2000

    resolvers mydns
      parse-resolv-conf
      hold valid 5s
      accepted_payload_size 8192

    defaults
      log global
      mode tcp
      timeout client 2h
      timeout connect 360s
      timeout server 2h
      default-server init-addr none
      retries 3

    frontend stats
        bind *:1024
        mode http
        http-request use-service prometheus-exporter if { path /metrics }
        stats enable
        stats uri /stats
        stats refresh 10s

    listen redis_master
      bind *:6379
      mode tcp

      option tcp-check
      tcp-check connect
      tcp-check send AUTH\ XXXXXXXXXXX\r\n

      tcp-check expect string +OK
      tcp-check send PING\r\n

      tcp-check expect string +PONG
      tcp-check send info\ replication\r\n

      tcp-check expect string role:master
      tcp-check send QUIT\r\n

      tcp-check expect string +OK
      server redis_node_0 discoursev2-redis-node-0.discoursev2-redis-headless.discoursev2.svc.cluster.local:6379 resolvers mydns check inter 1s fall 1 rise 1
      server redis_node_1 discoursev2-redis-node-1.discoursev2-redis-headless.discoursev2.svc.cluster.local:6379 resolvers mydns check inter 1s fall 1 rise 1
      server redis_node_2 discoursev2-redis-node-2.discoursev2-redis-headless.discoursev2.svc.cluster.local:6379 resolvers mydns check inter 1s fall 1 rise 1

    listen redis_slave
      bind *:6380
      mode tcp
      option tcp-check
      tcp-check connect
      tcp-check send AUTH\ cUvNwwd4ujTbqshTbYjeTTrX8tDlKAcYvvAUOGuk\r\n

      tcp-check expect string +OK
      tcp-check send PING\r\n

      tcp-check expect string +PONG
      tcp-check send QUIT\r\n

      tcp-check expect string +OK
      server redis_node_0 discoursev2-redis-node-0.discoursev2-redis-headless.discoursev2.svc.cluster.local:6379 resolvers mydns check inter 1s fall 1 rise 1
      server redis_node_1 discoursev2-redis-node-1.discoursev2-redis-headless.discoursev2.svc.cluster.local:6379 resolvers mydns check inter 1s fall 1 rise 1
      server redis_node_2 discoursev2-redis-node-2.discoursev2-redis-headless.discoursev2.svc.cluster.local:6379 resolvers mydns check inter 1s fall 1 rise 1

我们遇到的问题是,每次我们对 Discourse 部署进行滚动更新时,Unicorn 线程都无法连接到 Redis。它们会重试连接,过一段时间后,所有线程最终都能连接上。以下是我们日志中看到的一些错误:

/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:398:in `rescue in establish_connection'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/lib/unicorn/http_server.rb:684:in `init_worker_process'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/lib/unicorn/http_server.rb:547:in `spawn_missing_workers'
config/unicorn.conf.rb:299:in `block in reload'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/lib/unicorn/http_server.rb:721:in `worker_loop'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:379:in `establish_connection'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:115:in `block in connect'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:295:in `reconnect'
/var/www/discourse/lib/discourse.rb:906:in `after_fork'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/bin/unicorn:128:in `<top (required)>'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/bin/unicorn:25:in `<main>'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:344:in `with_reconnect'
E, [2024-07-24T15:42:22.524194 #531] ERROR -- : Error connecting to Redis on discoursev2-haproxy.discoursev2.svc.cluster.local:6379 (Socket::ResolutionError) (Redis::CannotConnectError)
/var/www/discourse/lib/discourse_redis.rb:217:in `reconnect'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/bin/unicorn:25:in `load'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:114:in `connect'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/lib/unicorn/http_server.rb:143:in `start'
I, [2024-07-24T15:42:23.499254 #422]  INFO -- : Loading Sidekiq in process id 422
2024-07-24 15:42:23 +0000: Starting Prometheus global reporter pid: 439
Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:host=>"discoursev2-haproxy.discoursev2.svc.cluster.local", :port=>6379, :password=>"REDACTED", :namespace=>"sidekiq"}
E, [2024-07-24T15:42:23.530765 #810] ERROR -- : Error connecting to Redis on discoursev2-haproxy.discoursev2.svc.cluster.local:6379 (Socket::ResolutionError) (Redis::CannotConnectError)
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:398:in `rescue in establish_connection'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:115:in `block in connect'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/lib/unicorn/http_server.rb:684:in `init_worker_process'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/lib/unicorn/http_server.rb:547:in `spawn_missing_workers'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/bin/unicorn:25:in `<main>'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:344:in `with_reconnect'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:295:in `reconnect'
config/unicorn.conf.rb:299:in `block in reload'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:379:in `establish_connection'
/var/www/discourse/lib/discourse.rb:906:in `after_fork'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/bin/unicorn:25:in `load'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/bin/unicorn:128:in `<top (required)>'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:114:in `connect'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/lib/unicorn/http_server.rb:721:in `worker_loop'
/var/www/discourse/lib/discourse_redis.rb:217:in `reconnect'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/lib/unicorn/http_server.rb:143:in `start'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/unicorn-6.1.0/lib/unicorn/http_server.rb:143:in `start'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:344:in `with_reconnect'
config/unicorn.conf.rb:299:in `block in reload'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:398:in `rescue in establish_connection'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:379:in `establish_connection'
/var/www/discourse/vendor/bundle/ruby/3.3.0/gems/redis-4.8.1/lib/redis/client.rb:295:in `reconnect'
/var/www/discourse/lib/discourse_redis.rb:217:in `reconnect'
E, [2024-07-24T15:42:24.540440 #887] ERROR -- : Error connecting to Redis on discoursev2-haproxy.discoursev2.svc.cluster.local:6379 (Socket::ResolutionError) (Redis::CannotConnectError)

我们还尝试移除 HAProxy,直接指向 Redis 主节点,也观察到了同样的问题。此外,我们还禁用了 Discourse 和 Redis 之间的 Istio 网关。在这两种情况下,过一段时间后,所有 Unicorn 线程都能成功连接到 Redis。

有人能就这个问题提供一些指导吗?我们需要对 Redis 进行任何特殊的设置吗?这是我们当前的设置:

appendonly yes
appendfsync no
aof-rewrite-incremental-fsync yes
save ""

最后,我们想知道:

  1. Redis 高可用模式的推荐方法是什么?据我们所知,Redis 集群不受支持。
  2. 如果使用单节点 Redis 设置,在 Redis 丢失几秒钟的情况下会有什么后果?如果我们使用无状态 Redis,每次重启都会丢失所有信息,Discourse 会重新填充内容吗?

提前致谢。

您好,社区各位成员:

我想就我之前发布的关于在 Kubernetes 中设置 Discourse 的帖子进行跟进,我们在部署回滚期间遇到了 Unicorn 线程和 Redis 之间的连接问题。尽管尝试了各种配置,包括移除 HAProxy 和禁用 Istio 网关,但问题仍然存在。

我们非常感谢您能提供任何指导或建议来解决此问题。具体来说:

  1. 我们的设置是否需要任何特殊的 Redis 配置?
  2. 考虑到 Redis Cluster 不受支持,确保 Redis 高可用性的推荐方法是什么?
  3. 如果我们选择单节点 Redis 设置,短暂的 Redis 故障会产生什么影响?在所有数据在重启时丢失的无状态 Redis 环境中,Discourse 会重新填充数据吗?

在我们继续排查此问题时,您的见解将非常有价值。

提前感谢您的支持。