@sam has deployed a workaround in Discourse for this problem; if you rebuild with the latest discourse_docker
changes, redis and pg logs should go to files, rather than Docker, and the bug shouldn’t(!) be triggered.
Once Docker releases a good version, should this file be removed?
Yep, once the bug is fixed the pin should be removed.
Docker is on version 17.12.0-ce, maybe this is fixed
Yep, looks like it might have been fixed accidentally. There’s no indication that it was deliberately fixed in the bug report.
We had that issue as well, and when freeing up some space didn’t fix it, I upgraded Docker to 17.12.0-ce. It worked the next (almost) four days, and then hung again after the backup. It’s been doing that the last two days again. See Neos Project if you’re interested.
Is this fixed for everybody else? Or do the issues continue for anyone?
I pinned several sites to 17.10 and upgraded another to 17.12 yesterday.
I’ve not had any problems since.
We’re going to need a lot more info if you’re after assistance tracking this down. Nature of the “hang”, logs of all shapes and sizes, that sort of thing.
I had the same experience you had and I’ve gone back to 17.10.
I know, but I was just curious if people with any of the symptoms mentioned in this topic still had issues. I am sick of being the only one whose problems are never solved as for everbody else.
Now, probably Discourse itself is reading this support forum and got scared, because since I posted here, it has not crashed for the last 120 hours. It seems 17.12 might indeed have fixed this.
What happened in those crashes: the Docker host is running fine, the app is up, but nginx returns a gateway timeout. As far as logs go: if there was anything interesting, I’d be glad. But there are no errors in the Discourse logs. That’s why I ended up here, the “Docker bug with long log lines” related to “backup does log long lines” explanation seemed a bulls-eye match.
There is an indication, in the unicorn logs, when the “long log lines” bug strikes – you end up with Redis::TimeoutError
exceptions being raised. If you’re not seeing those, then it isn’t this bug, or you’re running an unfamiliar configuration which doesn’t present in the same manner.
Ah, ok. Found that log now, and indeed on the day the last hang happened (2018-01-03) there are tons of Redis::TimeoutError
lines in the unicorn.stderr.log
.
E, [2018-01-02T03:43:02.934928 #386] ERROR -- : app error: Connection timed out (Redis::TimeoutError)
E, [2018-01-02T03:43:32.745829 #398] ERROR -- : app error: Connection timed out (Redis::TimeoutError)
E, [2018-01-02T03:43:32.995622 #386] ERROR -- : app error: Connection timed out (Redis::TimeoutError)
E, [2018-01-02T03:44:02.793192 #398] ERROR -- : app error: Connection timed out (Redis::TimeoutError)
E, [2018-01-02T03:44:03.038986 #386] ERROR -- : app error: Connection timed out (Redis::TimeoutError)
E, [2018-01-02T03:44:30.984465 #13850] ERROR -- : Error connecting to Redis on localhost:6379 (Redis::TimeoutError) (Redis::CannotConnectError)
E, [2018-01-02T03:44:31.020092 #13854] ERROR -- : Error connecting to Redis on localhost:6379 (Redis::TimeoutError) (Redis::CannotConnectError)
E, [2018-01-02T03:44:38.106718 #13865] ERROR -- : Error connecting to Redis on localhost:6379 (Redis::TimeoutError) (Redis::CannotConnectError)
E, [2018-01-02T03:44:38.101883 #13869] ERROR -- : Error connecting to Redis on localhost:6379 (Redis::TimeoutError) (Redis::CannotConnectError)
E, [2018-01-02T03:44:44.179293 #13880] ERROR -- : Error connecting to Redis on localhost:6379 (Redis::TimeoutError) (Redis::CannotConnectError)
E, [2018-01-02T03:44:45.189178 #13886] ERROR -- : Error connecting to Redis on localhost:6379 (Redis::TimeoutError) (Redis::CannotConnectError)
E, [2018-01-02T03:44:50.234502 #13894] ERROR -- : Error connecting to Redis on localhost:6379 (Redis::TimeoutError) (Redis::CannotConnectError)
E, [2018-01-02T03:44:56.294789 #13910] ERROR -- : Error connecting to Redis on localhost:6379 (Redis::TimeoutError) (Redis::CannotConnectError)
Failed to report error: Connection timed out 2 Error connecting to Redis on localhost:6379 (Redis::TimeoutError) subscribe failed, reconnecting in 1 second. Call stack ["/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.3/lib/redis/client.rb:345:in `rescue in establish_connection'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.3/lib/redis/client.rb:331:in `establish_connection'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.3/lib/redis/client.rb:101:in `block in connect'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.3/lib/redis/client.rb:293:in `with_reconnect'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.3/lib/redis/client.rb:100:in `connect'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.3/lib/redis/client.rb:276:in `with_socket_timeout'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.3/lib/redis/client.rb:133:in `call_loop'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.3/lib/redis/subscribe.rb:43:in `subscription'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.3/lib/redis/subscribe.rb:12:in `subscribe'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.3/lib/redis.rb:2765:in `_subscription'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.3/lib/redis.rb:2143:in `block in subscribe'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.3/lib/redis.rb:58:in `block in synchronize'", "/usr/local/lib/ruby/2.4.0/monitor.rb:214:in `mon_synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.3/lib/redis.rb:58:in `synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/redis-3.3.3/lib/redis.rb:2142:in `subscribe'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/message_bus-2.0.2/lib/message_bus/backends/redis.rb:304:in `global_subscribe'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/message_bus-2.0.2/lib/message_bus.rb:513:in `global_subscribe_thread'", "/var/www/discourse/vendor/bundle/ruby/2.4.0/gems/message_bus-2.0.2/lib/message_bus.rb:461:in `block in new_subscriber_thread'"]
E, [2018-01-02T03:45:04.359632 #13937] ERROR -- : Error connecting to Redis on localhost:6379 (Redis::TimeoutError) (Redis::CannotConnectError)
E, [2018-01-02T03:45:12.451658 #13954] ERROR -- : Error connecting to Redis on localhost:6379 (Redis::TimeoutError) (Redis::CannotConnectError)
E, [2018-01-02T03:45:18.529028 #13967] ERROR -- : Error connecting to Redis on localhost:6379 (Redis::TimeoutError) (Redis::CannotConnectError)
and so forth…
Oh dear… if you’re 100% sure you were running docker 17.12 at the time, including the containerd-shim (which is where the problem is), then Houston, we have a problem. The fact, though, that it’s stopped being a problem suggests that it’s either far harder to trigger the bug now, or alternately you kicking the container back to life was enough to cause the containerd-shim to be restarted with the 17.12 version, and now everything is hunky-dory forever more.
Well… I updated Docker on 2017-12-29 using the Ubuntu package management. Whether or not that restarts everything cleanly and completely I can’t tell. That I might need to know is a prpoblem in itself…
Ok, but it might be that the containerd-shim (whatever that is) has been restarted in the updated version now and the bug is gone for good. Thanks for the help!