Web interface timeouts (redis timeouts)

(Guido Drehsen) #1

last night I did the update to the latest release 2.0.0.beta 3 v2.0.0.beta3 +125

About an hour later the problems did start.
Even app restarts did only help for a few minutes, after around 5 minutes the redis timeouts did always reoccur. Even a server reboot did not help. (I had also installed all linux patches)

I did now at least 5 times in a row a “docker restart app” after every timeout and now it is running already for at least 20 minutes without any problem.
Sorry to say but I do not have any clue why it is now working and did not work before.
But nevertheless I wanted to inform you. I had seen that these kind of problems had also occured a couple of times in December redis timeouts 1 and redis timeouts 2

So you can close the support case, it is just for the records.

_Job exception: Connection timed out_

_Redis::TimeoutError (Connection timed out)_
_/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis/connection/hiredis.rb:58:in `rescue in read'_
_Redis::TimeoutError (Connection timed out)_
_/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis/connection/hiredis.rb:58:in `rescue in read'_
_Job exception: Connection timed out_

_Job exception: Connection timed out_

_Failed to process job: Connection timed out ["/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis/connection/hiredis.rb:58:in `rescue in read'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis/connection/hiredis.rb:54:in `read'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis/client.rb:262:in `block in read'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis/client.rb:250:in `io'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis/client.rb:261:in `read'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis/client.rb:120:in `block in call'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis/client.rb:231:in `block (2 levels) in process'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis/client.rb:367:in `ensure_connected'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis/client.rb:221:in `block in process'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis/client.rb:306:in `logging'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis/client.rb:220:in `process'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis/client.rb:120:in `call'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis.rb:2399:in `block in _eval'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis.rb:58:in `block in synchronize'", "/usr/local/lib/ruby/2.3.0/monitor.rb:214:in `mon_synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis.rb:58:in `synchronize'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis.rb:2398:in `_eval'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/redis-3.3.5/lib/redis.rb:2450:in `evalsha'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/message_bus-2.1.2/lib/message_bus/backends/redis.rb:380:in `cached_eval'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/message_bus-2.1.2/lib/message_bus/backends/redis.rb:140:in `publish'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/message_bus-2.1.2/lib/message_bus.rb:248:in `publish'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/message_bus-2.1.2/lib/message_bus.rb:485:in `block in new_subscriber_thread'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/message_bus-2.1.2/lib/message_bus/timer_thread.rb:102:in `do_work'", "/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/message_bus-2.1.2/lib/message_bus/timer_thread.rb:80:in `block (2 levels) in queue'"]_
_Job exception: Connection timed out_

(Matt Palmer) #2

The thing you should have taken away from the topics you linked is that this is fundamentally a bug in Docker. So, what version of Docker are you running, and have you tried downgrading to 17.10 and see if the issue goes away? We have tried to work around the problem within Discourse, but there’s limits to what we can manage. What does ./docker logs app show?

(Sam Saffron) #3

yeah, you have to start with:

cd /var/discourse
git pull
./launcher rebuild app

(Matt Palmer) #4

I thought you (or someone) fiddled things so that an update to a post-logging-fixed version prompted for a on-the-console upgrade?

(Sam Saffron) #5

In theory, yes, it should … hmmm :face_with_monocle:

(Guido Drehsen) #6

Docker version 18.02.0-ce, build fc4de44

and no I did not try to downgrade. After a couple of app restarts the error was gone.

(Guido Drehsen) #7

well that of course was the first thing I did after they had informed me that the server had timeouts. :wink:
Did not help but after a couple of app restarts the problem was gone and the server is now up and running for about 20 hours without further problems.

(John Vernon) #8

I was about to start another topic when this one was suggested.

I have tried downgrading Docker to 17.09 (even though it looked like 17.12 which I had installed prior may not have a problem. The issue still persists.

I’m running a web-only setup with RDS and Elasticache for the DB and Redis nodes. Additionally, I have an AWS ALB that is configured as a load balancer. When it is working, it is blazingly fast, but it get’s periods of timeouts. External monitoring is showing about 70% availability.

Thera are no errors, I am on the latest ( git pull and rebuild has been done several times)
Load Balancer, Web Servers, Redis, and DB are all under 1% CPU usage and have plenty of free drive space and RAM.

production_error.log is empty
production_strerr.log does seem to have this repeat cycle, but the timing doesn’t seem to correspond to our timeouts

/var/www/discourse/app/models/topic_tracking_state.rb:249: warning: circular argument reference - user_id
I, [2018-03-11T03:00:58.276994 #59]  INFO -- : listening on addr= fd=15
I, [2018-03-11T03:01:05.575280 #59]  INFO -- : master process ready
I, [2018-03-11T03:01:08.806643 #176]  INFO -- : worker=0 ready
I, [2018-03-11T03:01:10.785891 #204]  INFO -- : worker=1 ready
I, [2018-03-11T03:01:12.318838 #255]  INFO -- : worker=2 ready
I, [2018-03-11T03:01:12.386080 #316]  INFO -- : worker=3 ready

The nginx error logs are also free from any errors

I’ve been fighting this issue for nearly 24 hours now. Hoping someone else has some suggestions!

Here is my app.yml from the web front end servers:

  - "templates/sshd.template.yml"
  - "templates/web.template.yml"
  - "templates/web.ssl.template.yml"

## which TCP/IP ports should this container expose?
## If you want Discourse to share a port with another webserver like Apache or nginx,
## see https://meta.discourse.org/t/17247 for details
  - "80:80"   # http
  - "443:443" # https
  - "2222:22"

  db_default_text_search_config: "pg_catalog.english"

  LANG: en_US.UTF-8
  DISCOURSE_DB_NAME: 'discourse'
  DISCOURSE_HOSTNAME: 'community.droneup.com'
  DISCOURSE_SMTP_ADDRESS: 'email-smtp.us-east-1.amazonaws.com'

## The Docker container is stateless; all data is stored in /shared
  - volume:
      host: /var/discourse/shared/standalone
      guest: /shared
  - volume:
      host: /var/discourse/shared/standalone/log/var-log
      guest: /var/log

## Plugins go here
## see https://meta.discourse.org/t/19157 for details
    - exec:
        cd: $home/plugins
          - git clone https://github.com/discourse/docker_manager.git
          - git clone https://github.com/discourse/discourse-solved.git
          - git clone https://github.com/discourse/discourse-adplugin.git

## Any custom commands to run after building
  - exec: echo "Beginning of custom commands"
  ## If you want to set the 'From' email address for your first registration, uncomment and change:
  ## After getting the first signup email, re-comment the line. It only needs to run once.
  #- exec: rails r "SiteSetting.notification_email='info@unconfigured.discourse.org'"
  - exec: echo "End of custom commands"

(Matt Palmer) #9

Debug it from first principles: use logs, instrumentation, and direct observation (packet captures in the first instance, strace if required) to narrow down the component that’s stalling, and then dig deeper into the operation of that problematic component with rbtrace or strace as appropriate to figure out exactly what’s going wrong.