Debugging discourse downtime

Hi all – I had about 30 minutes of downtime yesterday with my DigitalOcean Droplet install of Discourse (as measured by external monitoring via Pingdom), and am wondering what the next steps are to diagnose and prevent recurrence.

What I’ve observed so far:

  • My Discourse install is 1.9.0.beta13. Installed plugins are at the bottom of this post.
  • A monitoring agent I had installed on the box (Librato) stopped reporting for the first ~15 minutes of the downtime
  • DigitalOcean’s graphs indicated a dip in CPU + network usage during the downtime, but didn’t indicate any Droplet level events occured
  • In the Discourse logs, I see 56 occurrences of the error “Job exception: execution expired” timestamped about the same time that Pingdom reported the site was back up again. The errors are for “Jobs::UserEmail”, and have an application version of 2db66072d7f7faef8f75078e2b84ad2635855636. Stack trace posted below.

Does this picture give any clues as to what happened? Not sure what to look at next.

Stack trace for the execution expired errors:

/usr/local/lib/ruby/2.3.0/net/smtp.rb:542:in `initialize'
/usr/local/lib/ruby/2.3.0/net/smtp.rb:542:in `open'
/usr/local/lib/ruby/2.3.0/net/smtp.rb:542:in `tcp_socket'
/usr/local/lib/ruby/2.3.0/net/smtp.rb:552:in `block in do_start'
/usr/local/lib/ruby/2.3.0/timeout.rb:101:in `timeout'
/usr/local/lib/ruby/2.3.0/net/smtp.rb:551:in `do_start'
/usr/local/lib/ruby/2.3.0/net/smtp.rb:521:in `start'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/mail-2.6.6/lib/mail/network/delivery_methods/smtp.rb:111:in `deliver!'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/mail-2.6.6/lib/mail/message.rb:2149:in `do_delivery'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/mail-2.6.6/lib/mail/message.rb:237:in `block in deliver'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/actionmailer-5.1.4/lib/action_mailer/base.rb:558:in `block in deliver_mail'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/activesupport-5.1.4/lib/active_support/notifications.rb:166:in `block in instrument'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/activesupport-5.1.4/lib/active_support/notifications/instrumenter.rb:21:in `instrument'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/activesupport-5.1.4/lib/active_support/notifications.rb:166:in `instrument'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/actionmailer-5.1.4/lib/action_mailer/base.rb:556:in `deliver_mail'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/mail-2.6.6/lib/mail/message.rb:237:in `deliver'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/actionmailer-5.1.4/lib/action_mailer/message_delivery.rb:96:in `block in deliver_now'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/actionmailer-5.1.4/lib/action_mailer/rescuable.rb:15:in `handle_exceptions'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/actionmailer-5.1.4/lib/action_mailer/message_delivery.rb:95:in `deliver_now'
/var/www/discourse/lib/email/sender.rb:184:in `send'
/var/www/discourse/app/jobs/regular/user_email.rb:41:in `execute'
/var/www/discourse/app/jobs/base.rb:134:in `block (2 levels) in perform'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/rails_multisite-1.1.0.rc4/lib/rails_multisite/connection_management.rb:71:in `with_connection'
/var/www/discourse/app/jobs/base.rb:129:in `block in perform'
/var/www/discourse/app/jobs/base.rb:125:in `each'
/var/www/discourse/app/jobs/base.rb:125:in `perform'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/processor.rb:188:in `execute_job'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/processor.rb:170:in `block (2 levels) in process'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/middleware/chain.rb:128:in `block in invoke'
/var/www/discourse/lib/sidekiq/pausable.rb:79:in `call'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/middleware/chain.rb:130:in `block in invoke'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/middleware/chain.rb:133:in `invoke'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/processor.rb:169:in `block in process'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/processor.rb:141:in `block (6 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/job_retry.rb:97:in `local'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/processor.rb:140:in `block (5 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq.rb:36:in `block in <module:Sidekiq>'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/processor.rb:136:in `block (4 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/processor.rb:204:in `stats'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/processor.rb:131:in `block (3 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/job_logger.rb:7:in `call'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/processor.rb:130:in `block (2 levels) in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/job_retry.rb:72:in `global'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/processor.rb:129:in `block in dispatch'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/logging.rb:44:in `with_context'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/logging.rb:38:in `with_job_hash_context'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/processor.rb:128:in `dispatch'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/processor.rb:168:in `process'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/processor.rb:85:in `process_one'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/processor.rb:73:in `run'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/util.rb:16:in `watchdog'
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/sidekiq-5.0.5/lib/sidekiq/util.rb:25:in `block in safe_thread'

Installed plugins on my Discourse:

discourse-details: 0.4	
discourse-narrative-bot: 0.0.1
discourse-nginx-performance-report:  0.1
discourse-presence: 1.0
discourse-solved: 0.1
docker_manager: 0.1
lazyYT: 1.0.1
poll:  0.9
1 Like

The stacktrace has to do with Discourse not being able to talk to your SMTP server. Since this is done in a background job I don’t think this is what caused your downtime.

The monitoring agent not reporting for the first ~15 minutes of the downtime suggests that this might be a system issue rather than a Discourse issue.

Maybe @supermathie has ideas what to look next?

5 Likes

Did the server reboot? (check uptime)

Did any messages get posted in the kernel logs? (check dmesg)

Any relevant entries in syslog? (less /var/log/syslog)

The stack traces seem to be a symptom of your downtime, I suspect you lost your networking for the duration of the outage:

/usr/local/lib/ruby/2.3.0/net/smtp.rb:542:in `initialize'
/usr/local/lib/ruby/2.3.0/net/smtp.rb:542:in `open'
/usr/local/lib/ruby/2.3.0/net/smtp.rb:542:in `tcp_socket'
...
8 Likes

Thanks guys!

System did not reboot (uptime 218 days). Unfortunately it looks like syslog and dmesg aren’t configured on my box to retain logs more than a day, so I don’t see anything from around then.

I do see a lot of messages (every 10 minutes or so), still ongoing, like this:

Oct 24 10:35:03 forum kernel: [18839676.542030] type=1400 audit(1508855703.196:112881): apparmor="DENIED" operation="ptrace" profile="docker-default" pid=27630 comm="ps" requested_mask="trace" denied_mask="trace" peer="docker-default

Not sure how that would be related, though.

Anyway, if it happens again, I know now to check syslog + dmesg right after the incident – I appreciate the help.

3 Likes

Did this ever happen again?

1 Like

Nope! Pingdom says it’s been up for the last 3 months (and I think 3 months ago I took it down for maintenance). Delightfully stable. Thanks so much for following up!

4 Likes