Losing redis connection

(Michael - DiscourseHosting.com) #1

Since 1.7 we sometimes see processes losing their Redis connection and from that moment on, they’re unable to reconnect until the process is restarted.

This is what we are seeing in the logs.

Failed to report error: no implicit conversion of nil into String 4 TypeError (no implicit conversion of nil into String) /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis/connection/hiredis.rb:19:in 'connect' web-exception


Unexpected error while processing request: no implicit conversion of nil into String
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis/connection/hiredis.rb:19:in `connect'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis/connection/hiredis.rb:19:in `connect'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis/client.rb:336:in `establish_connection'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis/client.rb:101:in `block in connect'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis/client.rb:293:in `with_reconnect'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis/client.rb:100:in `connect'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis/client.rb:364:in `ensure_connected'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis/client.rb:221:in `block in process'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis/client.rb:306:in `logging'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis/client.rb:220:in `process'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis/client.rb:120:in `call'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis.rb:1794:in `block in zrangebyscore'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis.rb:58:in `block in synchronize'
        /usr/local/rvm/rubies/ruby-2.3.1/lib/ruby/2.3.0/monitor.rb:214:in `mon_synchronize'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis.rb:58:in `synchronize'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.1/lib/redis.rb:1793:in `zrangebyscore'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/message_bus-2.0.2/lib/message_bus/backends/redis.rb:195:in `backlog'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/message_bus-2.0.2/lib/message_bus.rb:291:in `backlog'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/message_bus-2.0.2/lib/message_bus/client.rb:122:in `block in backlog'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/message_bus-2.0.2/lib/message_bus/client.rb:120:in `each'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/message_bus-2.0.2/lib/message_bus/client.rb:120:in `backlog'
        /var/www/discourse/vendor/bundle/ruby/2.3.0/gems/message_bus-2.0.2/lib/message_bus/rack/middleware.rb:135:in `call'

Now there are no memory issues, no config file changes, and last but not least: redis is working perfectly fine. It can be accessed by the other running processes as well. This only happens to a single process on a machine.

Yes, unsupported install, but does anyone have any idea what might be causing this? It almost feels like a memory corruption issue. This is only happening on 3 or 4 of our servers.

(Régis Hanol) #2

Looks similar to

(Michael - DiscourseHosting.com) #3

Yes, that is very similar and maybe exactly the same thing.

(Sam Saffron) #4

Are you running the exact same version of redis discourse docker ships?

(Michael - DiscourseHosting.com) #5

Probably not, 3.0.6 .
Which version does the Docker image ship with?

But to be honest, given the error, it looks like Discourse doesn’t even reach Redis - looks like it can’t even connect because the connection parameter turns out empty in some way.

(Sam Saffron) #6

Yeah our image is on 3.0.6 at the moment. Not sure what would be causing this issue.

(Matt Palmer) #7

IIRC, I’ve seen an error very much like this in a few cases where the site was misconfigured. The most vivid recollection is a situation where the master redis config wasn’t set, but the slave was, so failback attempts did not go well.

(Michael - DiscourseHosting.com) #8

Yes, it seems a misconfiguration indeed.

But what is happening is that this suddenly occurs for one single process within a larger pool. All other processes stay happy and fine at that time.

Another thing is that it doesn’t even get to the point where it is making the connection. It complains that (read: crashes because) the connection parameters are not ok. The same parameters that it used to make a connection before.

(Matt Palmer) #9

Ayup, that’s the behaviour I’ve seen.

(Michael - DiscourseHosting.com) #10

You were correct @mpalmer, thank you for pointing me in the good direction.
It’s related to this issue (the fix is not present in 1.7.x).

(Matt Palmer) #11

Well then, one more reason to upgrade! Alternately, from memory that fix should be relatively easy for you to backport.

(Philip Colmer) #12

We won’t be, in our situation, because I’ve deployed Discourse on AWS and followed instructions in various blogs to “break out” the different components. So, in our case, redis is on an AWS ElastiCache service. The engine version compatibility is 3.2.4.

AWS is running this as a 3 node configuration and it is also reporting 1 shard.

I’m now beginning to think that, at least as far as Redis is concerned, I maybe should have left this inside the docker image?