Restarted VM; now `/admin/upgrade` page fails to load, JS requests fail, some avatar images 404ing

Background info: site is lot.almost-dead.net, version 2.8.0.beta4, hosted on Google Cloud/Compute; I followed the official docker-based guide to set it up. (I’m proficient with browser front-end tech and understand the broad strokes of cloud hosting, but am familiar with AWS and not Google.)

Preliminary cause: I stopped the VM instance and then restarted it (attempting to adjust the env vars exposed to Discourse).

When the VM came back and the site was back up, I noticed that JS assets were timing out, along with some of the user avatars. On the admin panel, the Community Health area doesn’t load; the spinner continues to spin, and in Chrome Dev Tools the Network tab shows the /message-bus/.../poll requests all failing. The Upgrade page (/admin/upgrade) fails almost immediately and Chrome shows the ERR_FAILED code. When browsing topics, I see POST requests failing with ERR_CONNECTION_REFUSED and JS-initiated GET requests failing with ERR_FAILED. (This is from a browser which has logged-in cookies set for my admin account.)

Loading the site from a fresh browser instance shows Chrome’s ERR_CONNECTION_REFUSED error.

What I’ve tried:

  • server-side rebuild — sudo ./launcher rebuild app seems to succeed, but there is no change in the site’s behavior
    • also tried commenting out plugins and rebuilding, still no change
  • safe mode — requesting /safe-mode results in Chrome’s ERR_FAILED page immediately

Any suggestions?

have you tried

apt-get update
apt-get upgrade

then a rebuild ?

Tried just now, it appears to be the same as before.

No, your site is not back up, it’s completely down. You’re partly looking at cache. For someone who never visited your site, it doesn’t respond at all. This is likely something on the network/firewall level, or nginx is not starting / crashing.

3 Likes

Okay that makes sense.

Once I sudo ./launcher enter app it appears that nginx is running…

root@adn-prod-app:/var/www/discourse# ps aux | grep nginx
root       548  0.0  0.0   2156    64 ?        Ss   07:04   0:00 runsv nginx
root       558  0.0  0.1  55236  2524 ?        S    07:04   0:00 nginx: master process /usr/sbin/nginx
www-data   567  0.0  0.2  55996  5068 ?        S    07:04   0:00 nginx: worker process
www-data   568  0.0  0.0  55996  1628 ?        S    07:04   0:00 nginx: worker process
www-data   569  0.0  0.0  55792  1680 ?        S    07:04   0:00 nginx: cache manager process
root     23179  0.0  0.0   6140   884 pts/1    S+   21:23   0:00 grep nginx

I am not familiar enough with Google Cloud to know what to look for in their network/firewall settings… I do note that the VM Instance has the tags “http-server” and “https-server” and that their firewall system appears to use those tags to apply “default-allow-http” and “default-allow-https” built-in rules to instances with the tags. This sounds correct to me, and nslookup shows that the subdomain I’m using is resolving to the external IP address listed in Google’s UX, however the outside world still can’t access the instance.

In /var/discourse/shared/standalone/log/rails/production.log I see errors about connecting to Redis; is there a way I can repair that?

Job exception: Error connecting to Redis on localhost:6379 (Errno::EADDRNOTAVAIL)
Job exception: Error connecting to Redis on localhost:6379 (Errno::EADDRNOTAVAIL)
Job exception: Error connecting to Redis on localhost:6379 (Errno::EADDRNOTAVAIL)
Job exception: Error connecting to Redis on localhost:6379 (Errno::EADDRNOTAVAIL)
Job exception: Error connecting to Redis on localhost:6379 (Errno::EADDRNOTAVAIL)
Job exception: Error connecting to Redis on localhost:6379 (Errno::EADDRNOTAVAIL)
Error connecting to Redis on localhost:6379 (Errno::EADDRNOTAVAIL) subscribe failed, reconnecting in 1 second. Call stack /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.4.0/lib/redis/client.rb:384:in `rescue in establish_connection'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.4.0/lib/redis/client.rb:365:in `establish_connection'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.4.0/lib/redis/client.rb:117:in `block in connect'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.4.0/lib/redis/client.rb:330:in `with_reconnect'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.4.0/lib/redis/client.rb:116:in `connect'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.4.0/lib/redis/client.rb:403:in `ensure_connected'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.4.0/lib/redis/client.rb:255:in `block in process'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.4.0/lib/redis/client.rb:342:in `logging'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.4.0/lib/redis/client.rb:254:in `process'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.4.0/lib/redis/client.rb:148:in `call'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/rack-mini-profiler-2.3.3/lib/mini_profiler/profiling_methods.rb:85 :in `block in profile_method'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.4.0/lib/redis.rb:959:in `block in get'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.4.0/lib/redis.rb:70:in `block in synchronize'
/usr/local/lib/ruby/2.7.0/monitor.rb:202:in `synchronize'
/usr/local/lib/ruby/2.7.0/monitor.rb:202:in `mon_synchronize'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.4.0/lib/redis.rb:70:in `synchronize'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.4.0/lib/redis.rb:958:in `get'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/message_bus-3.3.6/lib/message_bus/backends/redis.rb:361:in `process_global_backlog'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/message_bus-3.3.6/lib/message_bus/backends/redis.rb:269:in `block in global_subscribe'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/message_bus-3.3.6/lib/message_bus/backends/redis.rb:282:in `global_subscribe'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/message_bus-3.3.6/lib/message_bus.rb:781:in `global_subscribe_thread'
/var/www/discourse/vendor/bundle/ruby/2.7.0/gems/message_bus-3.3.6/lib/message_bus.rb:729:in `block in new_subscriber_thread'
Job exception: Error connecting to Redis on localhost:6379 (Errno::EADDRNOTAVAIL)
Creating scope :open. Overwriting existing method Poll.open.

Hi,
long shot, but last time I saw some redis error it was somehow related to (well let’s say in the same topic as) the ssl certificates (missing?).
Maybe ./launcher logs app could help ?

Ah, there are some nginx warnings in here, followed by this message which seems like it’s missing an indentifier: Reload error for :

$ sudo ./launcher logs app
run-parts: executing /etc/runit/1.d/00-ensure-links
run-parts: executing /etc/runit/1.d/00-fix-var-logs
run-parts: executing /etc/runit/1.d/01-cleanup-web-pids
run-parts: executing /etc/runit/1.d/anacron
run-parts: executing /etc/runit/1.d/cleanup-pids
Cleaning stale PID files
run-parts: executing /etc/runit/1.d/copy-env
run-parts: executing /etc/runit/1.d/letsencrypt
[Tue 14 Sep 2021 10:44:41 PM UTC] Domains not changed.
[Tue 14 Sep 2021 10:44:41 PM UTC] Skip, Next renewal time is: Tue Oct  5 00:05:09 UTC 2021
[Tue 14 Sep 2021 10:44:41 PM UTC] Add '--force' to force to renew.
[Tue 14 Sep 2021 10:44:42 PM UTC] Installing key to:/shared/ssl/lot.almost-dead.net.key
[Tue 14 Sep 2021 10:44:42 PM UTC] Installing full chain to:/shared/ssl/lot.almost-dead.net.cer
[Tue 14 Sep 2021 10:44:42 PM UTC] Run reload cmd: sv reload nginx
warning: nginx: unable to open supervise/ok: file does not exist
[Tue 14 Sep 2021 10:44:42 PM UTC] Reload error for :
[Tue 14 Sep 2021 10:44:42 PM UTC] Domains not changed.
[Tue 14 Sep 2021 10:44:42 PM UTC] Skip, Next renewal time is: Mon Oct  4 00:06:04 UTC 2021
[Tue 14 Sep 2021 10:44:42 PM UTC] Add '--force' to force to renew.
[Tue 14 Sep 2021 10:44:43 PM UTC] Installing key to:/shared/ssl/lot.almost-dead.net_ecc.key
[Tue 14 Sep 2021 10:44:43 PM UTC] Installing full chain to:/shared/ssl/lot.almost-dead.net_ecc.cer
[Tue 14 Sep 2021 10:44:43 PM UTC] Run reload cmd: sv reload nginx
warning: nginx: unable to open supervise/ok: file does not exist
[Tue 14 Sep 2021 10:44:43 PM UTC] Reload error for :
Started runsvdir, PID is 546
ok: run: redis: (pid 554) 0s
ok: run: postgres: (pid 560) 0s
chgrp: invalid group: ‘syslog’
supervisor pid: 558 unicorn pid: 579
Shutting Down
run-parts: executing /etc/runit/3.d/01-nginx
ok: down: nginx: 1s, normally up
run-parts: executing /etc/runit/3.d/02-unicorn
(558) exiting
ok: down: unicorn: 0s, normally up
run-parts: executing /etc/runit/3.d/10-redis
ok: down: redis: 0s, normally up
run-parts: executing /etc/runit/3.d/99-postgres
ok: down: postgres: 0s, normally up
ok: down: nginx: 3s, normally up
ok: down: postgres: 1s, normally up
ok: down: redis: 2s, normally up
ok: down: cron: 0s, normally up
ok: down: unicorn: 2s, normally up
ok: down: rsyslog: 0s, normally up
run-parts: executing /etc/runit/1.d/00-ensure-links
run-parts: executing /etc/runit/1.d/00-fix-var-logs
run-parts: executing /etc/runit/1.d/01-cleanup-web-pids
run-parts: executing /etc/runit/1.d/anacron
run-parts: executing /etc/runit/1.d/cleanup-pids
Cleaning stale PID files
run-parts: executing /etc/runit/1.d/copy-env
run-parts: executing /etc/runit/1.d/letsencrypt
[Tue 14 Sep 2021 10:54:00 PM UTC] Domains not changed.
[Tue 14 Sep 2021 10:54:00 PM UTC] Skip, Next renewal time is: Tue Oct  5 00:05:09 UTC 2021
[Tue 14 Sep 2021 10:54:00 PM UTC] Add '--force' to force to renew.
[Tue 14 Sep 2021 10:54:00 PM UTC] Installing key to:/shared/ssl/lot.almost-dead.net.key
[Tue 14 Sep 2021 10:54:01 PM UTC] Installing full chain to:/shared/ssl/lot.almost-dead.net.cer
[Tue 14 Sep 2021 10:54:01 PM UTC] Run reload cmd: sv reload nginx
fail: nginx: runsv not running
[Tue 14 Sep 2021 10:54:01 PM UTC] Reload error for :
[Tue 14 Sep 2021 10:54:01 PM UTC] Domains not changed.
[Tue 14 Sep 2021 10:54:01 PM UTC] Skip, Next renewal time is: Mon Oct  4 00:06:04 UTC 2021
[Tue 14 Sep 2021 10:54:01 PM UTC] Add '--force' to force to renew.
[Tue 14 Sep 2021 10:54:01 PM UTC] Installing key to:/shared/ssl/lot.almost-dead.net_ecc.key
[Tue 14 Sep 2021 10:54:01 PM UTC] Installing full chain to:/shared/ssl/lot.almost-dead.net_ecc.cer
[Tue 14 Sep 2021 10:54:01 PM UTC] Run reload cmd: sv reload nginx
fail: nginx: runsv not running
[Tue 14 Sep 2021 10:54:01 PM UTC] Reload error for :
Started runsvdir, PID is 539
ok: run: redis: (pid 549) 0s
ok: run: postgres: (pid 555) 0s
chgrp: invalid group: ‘syslog’
supervisor pid: 551 unicorn pid: 579
$

Not sure why but could they be missing ?

I see them within the app…

root@adn-prod-app:/var/www/discourse# ls -l /shared/ssl/
total 24
-rw-r--r-- 1 root root 5950 Sep 14 22:54 lot.almost-dead.net.cer
-rw-r--r-- 1 root root 5329 Sep 14 22:54 lot.almost-dead.net_ecc.cer
-rw------- 1 root root  302 Sep 14 22:54 lot.almost-dead.net_ecc.key
-rw------- 1 root root 3243 Sep 14 22:54 lot.almost-dead.net.key

:facepalm: it appears my issue was that the VM came back up with a different IP address, and I needed to modify my A record to point to the new address.

Site is back up, thanks for listening!

1 Like