I had some issues with the upgrade - with the first forum failing on the first attempt (via the dashboard) then failed again via a rebuild, but seemed to had worked on the second rebuild attempt, although I then had to rebuild an additional time. That reminded me that I needed to stop all Discourse instances when I did the upgrade with the PG12 update (there are three Discourse forums on this server with individual containers) and thus the following worked for the other two forums:
However for some reason the first forum is no longer accessible, with Safari saying the server unexpectedly dropped the connection. Doing a rebuild seems to go fine, but itās not accessible and I can enter the app and Rails console and the database appears in tact.
Only warnings I can see with the rebuild that might be relevant:
168:M 31 Jan 2021 21:39:22.459 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
168:M 31 Jan 2021 21:39:22.459 # Server initialized
168:M 31 Jan 2021 21:39:22.459 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
168:M 31 Jan 2021 21:39:22.459 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').
168:M 31 Jan 2021 21:39:22.459 * Loading RDB produced by version 6.0.9
168:M 31 Jan 2021 21:39:22.459 * RDB age 21 seconds
168:M 31 Jan 2021 21:39:22.459 * RDB memory usage when created 4.03 Mb
168:M 31 Jan 2021 21:39:22.466 * DB loaded from disk: 0.006 seconds
168:M 31 Jan 2021 21:39:22.466 * Ready to accept connections
production.log:
Job exception: Error connecting to Redis on localhost:6379 (Errno::ENETUNREACH)
Error connecting to Redis on localhost:6379 (Errno::ENETUNREACH) subscribe failed, reconnecting in 1 second. Call stack /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:367:in `rescue in establish_connection'
Similar messages appear in unicorn.stderr.log and unicorn.stdout.log.
Entering the container and redis-cli ping I get a PONG back. Redis is running on the server (but not in individual containers - tho this has always been the case afaik).
Any ideas what might be going on?
(Iāve also rebooted the server, and created a new letsencrypt cert for this domain to be on the safe side - but still the same.)
That looks as though everything should be workingā¦ have you tried on a different browser, or clearing your cache? If that doesnāt help, could you post the output of:
Some things that could be the cause of the empty response error:
The server is in a VPN and there is no access to the port.
If you have multiple discourse instances in the same server, I assume there is a reverse proxy in front. Make sure that it points to the discourse container (maybe you will need to restart it).
There isnāt enough space in the server (you can run df -hT /).
I would start to see the free disk space first (3).
Disk usage was showing at 31% but I did a ./launcher cleanup anyway:
docker container ls
(To ensure all three forum containers are running)
./launcher cleanup
WARNING! This will remove all stopped containers.
Are you sure you want to continue? [y/N] y
Total reclaimed space: 0B
WARNING! This will remove all images without at least one container associated to them.
Are you sure you want to continue? [y/N] y
Deleted Images:
...
Total reclaimed space: 32.56GB
We use HAProxy and I checked it (and restarted it) and is up and running (we also do the redirect from http to https via it and that works fine for the domain as well, so I donāt think itās an issue there - plus it was working until this update).
I can still enter the container and access the Rails console and the DB is still there/connected with the container - so this is just extremely weird - anyone have any other ideas or any other steps to trouble shoot this?
If you havenāt been able to debug what is going on, an option may be to take a backup from the command line, and restore it to a new fresh site running on PG13. Alternatively if you need your site back up and running, you can revert the version to PG12 and move the existing shared/postgres_data_old directory back to shared/postgres_data and rebuild. However, Iād recommend trying the backup/restore instead as the issue doesnāt seem related to the database upgrade itself.
If you have any other ideas to investigate this Iād be happy to try them Michael - luckily itās not a huge deal this forum being offline as it was in read-only mode anyway (having been replaced by another forum).
If youāre out of ideas then Iāll go ahead and try restoring a back up, but if possible Iād like to troubleshoot this as Iām interested in learning why this happened (as I guess you might be) - so Iām definitely up for looking into this further if you are.
Tbh it has made me a little nervous about converting some of my other forums to Discourse, and knowing what went wrong could be useful for us all.
Itās a standard multiple container install where each forum has itās own app.yml and host: discourse/shared-site-name/standalone and host: discourse/shared-site-name/standalone/log/var-log based container set-up (as per the questions Iāve asked and posts on this forum).
Entering each container and psql (sudo -u postgres psql discourse) and \l+ shows just one discourse database per container (and each are different sizes) so Iām guessing these are independent Discourse instances.
Do you have a link to the āstandardā way to run multiple independent discourse forums on a server? I can check to see whether thatās the same as what Iāve got here although Iām fairly sure what I have is based on posts and guidance from the Discourse team.
Are you running nginx inside the container? Next thing Iād try to follow is where the requests are ending up. So I understand, you have HAProxy performing the SSL termination and then proxying requests into the respective containers?
As far as Iām aware the containers themselves are all āstandardā (so I gather each is running nginx) and yes, HAProxy handles all of the SSL and directs requests to each container.
backend main_apache_sites
server server1 127.0.0.1:8080 cookie A check
cookie JSESSIONID prefix nocache
backend discourse_docker
server server2 127.0.0.1:8888 cookie A check
cookie JSESSIONID prefix nocache
backend discourse_docker_2
server server2 127.0.0.1:8889 cookie A check
cookie JSESSIONID prefix nocache
backend discourse_docker_3
server server2 127.0.0.1:8890 cookie A check
cookie JSESSIONID prefix nocache
backend letsencrypt-backend
server letsencrypt 127.0.0.1:54321
Where for some reason all of the discourse backends had server2 on the second line - I changed these to server2, server3 yesterday, etc but itās not made any difference (and it was working fine like this previously).
Are there are specific log files I could look at that might provide further clues? Perhaps Docker log files?
Yes those are commented out:
templates:
- "templates/postgres.template.yml"
- "templates/redis.template.yml"
- "templates/web.template.yml"
- "templates/web.ratelimited.template.yml"
## Uncomment these two lines if you wish to add Lets Encrypt (https)
#- "templates/web.ssl.template.yml"
#- "templates/web.letsencrypt.ssl.template.yml"
The nginx logs inside the app containers should be able to confirm that the requests are making it to the application, can you check those? nginx in the container proxies requests to 127.0.0.1:3000 which is to the unicorn process.
Looking in /var/log/nginx and /shared/log/rails nothing really stands out, in fact none of the logs were updated today (the 4th) apart from /shared/log/rails/production.log which just has a few Jobs like this:
I also changed the port in HAProxy and I got a server not found error as expected, then updated the container to the same port and it reverted back to the same behaviour (so I think this can rule out a HAProxy issue).
Are there any Docker logs to look at? Or can I save/export this container and send it to you so you can have a look? I guess youāre wondering what went wrong just as much as I am
None of the nginx logs have been touched today, although the last log on the 30th of Jan shows 7: limiting requests by zone āfloodā client: my.ip.address, POST /mini-profiler-resources type error.
Edit: not sure if this is any help, but running docker logs APP:
# docker logs f2
run-parts: executing /etc/runit/1.d/00-ensure-links
run-parts: executing /etc/runit/1.d/00-fix-var-logs
run-parts: executing /etc/runit/1.d/01-cleanup-web-pids
run-parts: executing /etc/runit/1.d/anacron
run-parts: executing /etc/runit/1.d/cleanup-pids
Cleaning stale PID files
run-parts: executing /etc/runit/1.d/copy-env
Started runsvdir, PID is 42
ok: run: redis: (pid 55) 0s
ok: run: postgres: (pid 54) 0s
chgrp: invalid group: āsyslogā
supervisor pid: 51 unicorn pid: 82
(51) Reopening logs
(51) Reopening logs
(51) Reopening logs
(51) Stopping Sidekiq
(51) Reloading unicorn (82)
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid... 22039
(51) Old pid is: 82 New pid is: 22039
(51) Stopping Sidekiq
(51) Reloading unicorn (22039)
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid...
(51) Waiting for new unicorn master pid... 23358
(51) Old pid is: 22039 New pid is: 23358
(51) Reopening logs
(51) Reopening logs
Examining the logs and reading back through your previous replies - the application is attempting to access redis on localhost:6379 inside the container, and it looks as though redis is starting fine too, however for some reason it canāt connect (puzzling). Though it is possible that these error messages were from when message_bus is trying to connect before redis starts, or after it stops in the event of a restart.
You mentioned that redis is running on the server but not in individual containers - could you expand on that?
With this config, Redis will run inside the container (as you can see in the docker logs output).
On another note, when you navigate to the URL of the site that isnāt working, what appears in your nginx logs? error.log should be empty, and access.log should be filled with various HTTP requests. Just trying to narrow down at what point something is going wrong.
Sorry I mixed things up. Redis is in fact working in each container, verified by running this on the server itself, and then each of the three Discourse containers with the same output for each:
$ redis-cli ping
PONG
$ redis-server
# Creating Server TCP listening socket *:6379: bind: Address already in use (means it's already started)
$ redis-cli
127.0.0.1:6379> ping
PONG
127.0.0.1:6379> get mykey
(nil)
127.0.0.1:6379> set mykey somevalue
OK
127.0.0.1:6379> get mykey
"somevalue"
The same is true for all three (noteworthy: the first get mykey always returns nil) thus itās safe to say Redis is up and independently running in all containers.
Itās empty and nothing is written to in that directory today:
drwxr-xr-x 2 www-data www-data 4096 Feb 4 21:26 .
drwxrwxr-x 9 root root 4096 Feb 2 08:03 ..
-rw-r--r-- 1 www-data www-data 0 Feb 3 07:38 access.log
-rw-r--r-- 1 www-data www-data 0 Feb 2 08:03 access.log.1
-rw-r--r-- 1 www-data www-data 294 Feb 1 09:43 access.log.2.gz
-rw-r--r-- 1 www-data www-data 37598 Jan 30 23:56 access.log.3.gz
-rw-r--r-- 1 www-data www-data 58059 Jan 30 07:36 access.log.4.gz
-rw-r--r-- 1 www-data www-data 55988 Jan 29 07:34 access.log.5.gz
-rw-r--r-- 1 www-data www-data 73964 Jan 28 07:49 access.log.6.gz
-rw-r--r-- 1 www-data www-data 78069 Jan 27 07:53 access.log.7.gz
-rw-r--r-- 1 www-data www-data 0 Feb 3 07:38 error.log
-rw-r--r-- 1 www-data www-data 0 Feb 2 08:03 error.log.1
-rw-r--r-- 1 www-data www-data 20 Feb 1 00:31 error.log.2.gz
-rw-r--r-- 1 www-data www-data 632 Jan 30 23:46 error.log.3.gz
-rw-r--r-- 1 www-data www-data 265 Jan 29 09:07 error.log.4.gz
-rw-r--r-- 1 www-data www-data 20 Jan 28 07:50 error.log.5.gz
-rw-r--r-- 1 www-data www-data 3107 Jan 28 07:41 error.log.6.gz
-rw-r--r-- 1 www-data www-data 20 Jan 26 07:53 error.log.7.gz
Checking the access logs for another container and itās fine, so itās just this one.
Seems like HAProxy is sending the request through but the container isnāt able to handle or accept it - wonder if thereās anything that can be reset there? (Which I would have thought rebuilding the container would be doing anyway?)
IMAGE COMMAND CREATED STATUS PORTS
local_discourse/1 "/sbin/boot" 20 hours ago Up 20 hours 0.0.0.0:2225->22/tcp, 0.0.0.0:8892->80/tcp
local_discourse/2 "/sbin/boot" 4 days ago Up 4 days 0.0.0.0:2223->22/tcp, 0.0.0.0:8889->80/tcp
local_discourse/3 "/sbin/boot" 4 days ago Up 4 days 0.0.0.0:2224->22/tcp, 0.0.0.0:8890->80/tcp
My gut feeling thinks itās to do with the failed attempt via the dashboard - usually for PG/major updates the dashboard says you need to do a rebuild and that updating is disabled via the dashboard, but for some reason it didnāt (maybe because I hadnāt updated that forum in a while - hence thinking I ought to do it via the dashboard first) or itās possible it hadnāt shut down or started properly before doing the rebuild
In the HAProxy config, I can see the backends are configured to forward to ports 8888, 8889, and 8890:
However the app containers are listening on 8892, 8889, 8890 - looks like a discrepancy for the backend discourse_docker. Is that something youāve updated in config since that was posted?
Yep, the HAProxy ports correspond to the correct container ports Iām pretty sure itās not related to this as it was working fine - itās just after that upgrade/rebuild that this happened.
Entering the container and opening Top stats, then going to the site doesnāt seem to make any difference either. In case itās any help, hereās a screenshot: