Firewall issue with running multiple containers after upgrade

AstonJ · January 31, 2021, 10:03pm

I had some issues with the upgrade - with the first forum failing on the first attempt (via the dashboard) then failed again via a rebuild, but seemed to had worked on the second rebuild attempt, although I then had to rebuild an additional time. That reminded me that I needed to stop all Discourse instances when I did the upgrade with the PG12 update (there are three Discourse forums on this server with individual containers) and thus the following worked for the other two forums:

However for some reason the first forum is no longer accessible, with Safari saying the server unexpectedly dropped the connection. Doing a rebuild seems to go fine, but it’s not accessible and I can enter the app and Rails console and the database appears in tact.

Only warnings I can see with the rebuild that might be relevant:

168:M 31 Jan 2021 21:39:22.459 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
168:M 31 Jan 2021 21:39:22.459 # Server initialized
168:M 31 Jan 2021 21:39:22.459 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
168:M 31 Jan 2021 21:39:22.459 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').
168:M 31 Jan 2021 21:39:22.459 * Loading RDB produced by version 6.0.9
168:M 31 Jan 2021 21:39:22.459 * RDB age 21 seconds
168:M 31 Jan 2021 21:39:22.459 * RDB memory usage when created 4.03 Mb
168:M 31 Jan 2021 21:39:22.466 * DB loaded from disk: 0.006 seconds
168:M 31 Jan 2021 21:39:22.466 * Ready to accept connections

production.log:


Job exception: Error connecting to Redis on localhost:6379 (Errno::ENETUNREACH)

Error connecting to Redis on localhost:6379 (Errno::ENETUNREACH) subscribe failed, reconnecting in 1 second. Call stack /var/www/discourse/vendor/bundle/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:367:in `rescue in establish_connection'

Similar messages appear in unicorn.stderr.log and unicorn.stdout.log.

Entering the container and redis-cli ping I get a PONG back. Redis is running on the server (but not in individual containers - tho this has always been the case afaik).

Any ideas what might be going on?

(I’ve also rebooted the server, and created a new letsencrypt cert for this domain to be on the safe side - but still the same.)

fitzy · January 31, 2021, 11:25pm

That looks as though everything should be working… have you tried on a different browser, or clearing your cache? If that doesn’t help, could you post the output of:

curl -vv -o /dev/null <forum url>

AstonJ · January 31, 2021, 11:36pm

I have tried on multiple browsers but I get the same Michael. Here’s the output of that command:

~$ curl -vv -o /dev/null https://metaruby.com
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 78.46.110.60...
* TCP_NODELAY set
* Connected to metaruby.com (78.46.110.60) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
} [226 bytes data]
* TLSv1.2 (IN), TLS handshake, Server hello (2):
{ [93 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [2473 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [333 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: CN=metaruby.com
*  start date: Jan 31 03:33:05 2021 GMT
*  expire date: May  1 03:33:05 2021 GMT
*  subjectAltName: host "metaruby.com" matched cert's "metaruby.com"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
> GET / HTTP/1.1
> Host: metaruby.com
> User-Agent: curl/7.64.1
> Accept: */*
> 
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0* TLSv1.2 (IN), TLS alert, close notify (256):
{ [2 bytes data]
* Empty reply from server
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
* Connection #0 to host metaruby.com left intact
curl: (52) Empty reply from server
* Closing connection 0

lucasbasquerotto · February 1, 2021, 12:39pm

Some things that could be the cause of the empty response error:

The server is in a VPN and there is no access to the port.
If you have multiple discourse instances in the same server, I assume there is a reverse proxy in front. Make sure that it points to the discourse container (maybe you will need to restart it).
There isn’t enough space in the server (you can run df -hT /).

I would start to see the free disk space first (3).

AstonJ · February 1, 2021, 4:51pm

Disk usage was showing at 31% but I did a ./launcher cleanup anyway:

docker container ls 
(To ensure all three forum containers are running)

./launcher cleanup

WARNING! This will remove all stopped containers.
Are you sure you want to continue? [y/N] y
Total reclaimed space: 0B
WARNING! This will remove all images without at least one container associated to them.
Are you sure you want to continue? [y/N] y
Deleted Images:
...
Total reclaimed space: 32.56GB

We use HAProxy and I checked it (and restarted it) and is up and running (we also do the redirect from http to https via it and that works fine for the domain as well, so I don’t think it’s an issue there - plus it was working until this update).

I can still enter the container and access the Rails console and the DB is still there/connected with the container - so this is just extremely weird - anyone have any other ideas or any other steps to trouble shoot this?

fitzy · February 1, 2021, 10:47pm

If you haven’t been able to debug what is going on, an option may be to take a backup from the command line, and restore it to a new fresh site running on PG13. Alternatively if you need your site back up and running, you can revert the version to PG12 and move the existing shared/postgres_data_old directory back to shared/postgres_data and rebuild. However, I’d recommend trying the backup/restore instead as the issue doesn’t seem related to the database upgrade itself.

pfaffman · February 1, 2021, 10:51pm

You’re a bit beyond a Standard supported install here.

Does each Discourse have its own postgres, or do you have one postgres for all three?

If you do have a single postgres/data container, then you want to STOP all of the Discourses before trying to upgrade postgres.

HaProxy doesn’t have anything to do with Postgres, so I don’t think that matters.

AstonJ · February 2, 2021, 12:38am

If you have any other ideas to investigate this I’d be happy to try them Michael - luckily it’s not a huge deal this forum being offline as it was in read-only mode anyway (having been replaced by another forum).

If you’re out of ideas then I’ll go ahead and try restoring a back up, but if possible I’d like to troubleshoot this as I’m interested in learning why this happened (as I guess you might be) - so I’m definitely up for looking into this further if you are.

Tbh it has made me a little nervous about converting some of my other forums to Discourse, and knowing what went wrong could be useful for us all.

It’s a standard multiple container install where each forum has it’s own app.yml and host: discourse/shared-site-name/standalone and host: discourse/shared-site-name/standalone/log/var-log based container set-up (as per the questions I’ve asked and posts on this forum).

Entering each container and psql (sudo -u postgres psql discourse) and \l+ shows just one discourse database per container (and each are different sizes) so I’m guessing these are independent Discourse instances.

Do you have a link to the ‘standard’ way to run multiple independent discourse forums on a server? I can check to see whether that’s the same as what I’ve got here although I’m fairly sure what I have is based on posts and guidance from the Discourse team.

fitzy · February 2, 2021, 12:47am

Are you running nginx inside the container? Next thing I’d try to follow is where the requests are ending up. So I understand, you have HAProxy performing the SSL termination and then proxying requests into the respective containers?

pfaffman · February 2, 2021, 12:59am

Ah. OK. So for each you should do ./launcher rebuild YOUR-APP-NAME twice. I don’t think you can do it from the web interface.

And the yml containers all have the ssl and letsencrypt templates commented out (or removed)?

AstonJ · February 2, 2021, 1:16am

As far as I’m aware the containers themselves are all ‘standard’ (so I gather each is running nginx) and yes, HAProxy handles all of the SSL and directs requests to each container.

My set-up is per the write-up here: Set up Discourse on a server with existing Apache sites (with the SSL version of the HAProxy configuration here).

There was one issue with the HAProxy config:

backend main_apache_sites
  server server1 127.0.0.1:8080 cookie A check
  cookie JSESSIONID prefix nocache

backend discourse_docker
  server server2 127.0.0.1:8888 cookie A check
  cookie JSESSIONID prefix nocache

backend discourse_docker_2
  server server2 127.0.0.1:8889 cookie A check
  cookie JSESSIONID prefix nocache

backend discourse_docker_3
  server server2 127.0.0.1:8890 cookie A check
  cookie JSESSIONID prefix nocache

backend letsencrypt-backend
  server letsencrypt 127.0.0.1:54321

Where for some reason all of the discourse backends had server2 on the second line - I changed these to server2, server3 yesterday, etc but it’s not made any difference (and it was working fine like this previously).

Are there are specific log files I could look at that might provide further clues? Perhaps Docker log files?

Yes those are commented out:

templates:
  - "templates/postgres.template.yml"
  - "templates/redis.template.yml"
  - "templates/web.template.yml"
  - "templates/web.ratelimited.template.yml"
## Uncomment these two lines if you wish to add Lets Encrypt (https)
  #- "templates/web.ssl.template.yml"
  #- "templates/web.letsencrypt.ssl.template.yml"

fitzy · February 4, 2021, 4:25am

The nginx logs inside the app containers should be able to confirm that the requests are making it to the application, can you check those? nginx in the container proxies requests to 127.0.0.1:3000 which is to the unicorn process.

AstonJ · February 4, 2021, 9:12pm

Looking in /var/log/nginx and /shared/log/rails nothing really stands out, in fact none of the logs were updated today (the 4th) apart from /shared/log/rails/production.log which just has a few Jobs like this:

Rails logs:

Nginix logs:

I also changed the port in HAProxy and I got a server not found error as expected, then updated the container to the same port and it reverted back to the same behaviour (so I think this can rule out a HAProxy issue).

Are there any Docker logs to look at? Or can I save/export this container and send it to you so you can have a look? I guess you’re wondering what went wrong just as much as I am

AstonJ · February 4, 2021, 9:23pm

Actually just looked again (the above was from last night) and there’s some now in:

unicorn.stderr.log

(Sorry it won’t let me copy the text)

None of the nginx logs have been touched today, although the last log on the 30th of Jan shows 7: limiting requests by zone “flood” client: my.ip.address, POST /mini-profiler-resources type error.

Edit: not sure if this is any help, but running docker logs APP:

For the forum not working:

# docker logs metaruby
run-parts: executing /etc/runit/1.d/00-ensure-links
run-parts: executing /etc/runit/1.d/00-fix-var-logs
run-parts: executing /etc/runit/1.d/01-cleanup-web-pids
run-parts: executing /etc/runit/1.d/anacron
run-parts: executing /etc/runit/1.d/cleanup-pids
Cleaning stale PID files
run-parts: executing /etc/runit/1.d/copy-env
Started runsvdir, PID is 43
ok: run: redis: (pid 55) 0s
ok: run: postgres: (pid 56) 0s
chgrp: invalid group: ‘syslog’
supervisor pid: 50 unicorn pid: 89

For forum 2 (working fine):

# docker logs f2

run-parts: executing /etc/runit/1.d/00-ensure-links

run-parts: executing /etc/runit/1.d/00-fix-var-logs

run-parts: executing /etc/runit/1.d/01-cleanup-web-pids

run-parts: executing /etc/runit/1.d/anacron

run-parts: executing /etc/runit/1.d/cleanup-pids

Cleaning stale PID files

run-parts: executing /etc/runit/1.d/copy-env

Started runsvdir, PID is 42

ok: run: redis: (pid 55) 0s

ok: run: postgres: (pid 54) 0s

chgrp: invalid group: ‘syslog’

supervisor pid: 51 unicorn pid: 82

(51) Reopening logs

(51) Reopening logs

(51) Reopening logs

(51) Stopping Sidekiq

(51) Reloading unicorn (82)

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid... 22039

(51) Old pid is: 82 New pid is: 22039

(51) Stopping Sidekiq

(51) Reloading unicorn (22039)

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid...

(51) Waiting for new unicorn master pid... 23358

(51) Old pid is: 22039 New pid is: 23358

(51) Reopening logs

(51) Reopening logs

For forum three (working fine too):

# docker logs f3

run-parts: executing /etc/runit/1.d/00-ensure-links

run-parts: executing /etc/runit/1.d/00-fix-var-logs

run-parts: executing /etc/runit/1.d/01-cleanup-web-pids

run-parts: executing /etc/runit/1.d/anacron

run-parts: executing /etc/runit/1.d/cleanup-pids

Cleaning stale PID files

run-parts: executing /etc/runit/1.d/copy-env

Started runsvdir, PID is 42

ok: run: redis: (pid 54) 0s

chgrp: invalid group: ‘syslog’

ok: run: postgres: (pid 55) 0s

supervisor pid: 56 unicorn pid: 88

(56) Reopening logs

(56) Reopening logs

(56) Reopening logs

(56) Reopening logs

(56) Reopening logs

fitzy · February 5, 2021, 12:28am

Examining the logs and reading back through your previous replies - the application is attempting to access redis on localhost:6379 inside the container, and it looks as though redis is starting fine too, however for some reason it can’t connect (puzzling). Though it is possible that these error messages were from when message_bus is trying to connect before redis starts, or after it stops in the event of a restart.

You mentioned that redis is running on the server but not in individual containers - could you expand on that?

With this config, Redis will run inside the container (as you can see in the docker logs output).

On another note, when you navigate to the URL of the site that isn’t working, what appears in your nginx logs? error.log should be empty, and access.log should be filled with various HTTP requests. Just trying to narrow down at what point something is going wrong.

AstonJ · February 5, 2021, 2:00am

Sorry I mixed things up. Redis is in fact working in each container, verified by running this on the server itself, and then each of the three Discourse containers with the same output for each:

$ redis-cli ping
PONG
$ redis-server
# Creating Server TCP listening socket *:6379: bind: Address already in use (means it's already started)
$ redis-cli
127.0.0.1:6379> ping
PONG
127.0.0.1:6379> get mykey
(nil)
127.0.0.1:6379> set mykey somevalue
OK
127.0.0.1:6379> get mykey
"somevalue"

The same is true for all three (noteworthy: the first get mykey always returns nil) thus it’s safe to say Redis is up and independently running in all containers.

It’s empty and nothing is written to in that directory today:

drwxr-xr-x 2 www-data www-data  4096 Feb  4 21:26 .
drwxrwxr-x 9 root     root      4096 Feb  2 08:03 ..
-rw-r--r-- 1 www-data www-data     0 Feb  3 07:38 access.log
-rw-r--r-- 1 www-data www-data     0 Feb  2 08:03 access.log.1
-rw-r--r-- 1 www-data www-data   294 Feb  1 09:43 access.log.2.gz
-rw-r--r-- 1 www-data www-data 37598 Jan 30 23:56 access.log.3.gz
-rw-r--r-- 1 www-data www-data 58059 Jan 30 07:36 access.log.4.gz
-rw-r--r-- 1 www-data www-data 55988 Jan 29 07:34 access.log.5.gz
-rw-r--r-- 1 www-data www-data 73964 Jan 28 07:49 access.log.6.gz
-rw-r--r-- 1 www-data www-data 78069 Jan 27 07:53 access.log.7.gz
-rw-r--r-- 1 www-data www-data     0 Feb  3 07:38 error.log
-rw-r--r-- 1 www-data www-data     0 Feb  2 08:03 error.log.1
-rw-r--r-- 1 www-data www-data    20 Feb  1 00:31 error.log.2.gz
-rw-r--r-- 1 www-data www-data   632 Jan 30 23:46 error.log.3.gz
-rw-r--r-- 1 www-data www-data   265 Jan 29 09:07 error.log.4.gz
-rw-r--r-- 1 www-data www-data    20 Jan 28 07:50 error.log.5.gz
-rw-r--r-- 1 www-data www-data  3107 Jan 28 07:41 error.log.6.gz
-rw-r--r-- 1 www-data www-data    20 Jan 26 07:53 error.log.7.gz

Checking the access logs for another container and it’s fine, so it’s just this one.

Seems like HAProxy is sending the request through but the container isn’t able to handle or accept it - wonder if there’s anything that can be reset there? (Which I would have thought rebuilding the container would be doing anyway?)

fitzy · February 5, 2021, 2:39am

It does sound like that. Can you confirm which port bindings are present for each container when you run docker ps on the host?

AstonJ · February 5, 2021, 3:11am

Sure:

IMAGE                   COMMAND             CREATED             STATUS              PORTS                                     
local_discourse/1      	"/sbin/boot"        20 hours ago        Up 20 hours         0.0.0.0:2225->22/tcp, 0.0.0.0:8892->80/tcp
local_discourse/2   	"/sbin/boot"        4 days ago          Up 4 days           0.0.0.0:2223->22/tcp, 0.0.0.0:8889->80/tcp
local_discourse/3       "/sbin/boot"        4 days ago          Up 4 days           0.0.0.0:2224->22/tcp, 0.0.0.0:8890->80/tcp

My gut feeling thinks it’s to do with the failed attempt via the dashboard - usually for PG/major updates the dashboard says you need to do a rebuild and that updating is disabled via the dashboard, but for some reason it didn’t (maybe because I hadn’t updated that forum in a while - hence thinking I ought to do it via the dashboard first) or it’s possible it hadn’t shut down or started properly before doing the rebuild

fitzy · February 5, 2021, 3:15am

In the HAProxy config, I can see the backends are configured to forward to ports 8888, 8889, and 8890:

AstonJ:

backend discourse_docker
  server server2 127.0.0.1:8888 cookie A check
  cookie JSESSIONID prefix nocache

backend discourse_docker_2
  server server2 127.0.0.1:8889 cookie A check
  cookie JSESSIONID prefix nocache

backend discourse_docker_3
  server server2 127.0.0.1:8890 cookie A check
  cookie JSESSIONID prefix nocache

However the app containers are listening on 8892, 8889, 8890 - looks like a discrepancy for the backend discourse_docker. Is that something you’ve updated in config since that was posted?

AstonJ · February 5, 2021, 3:43am

Yep, the HAProxy ports correspond to the correct container ports I’m pretty sure it’s not related to this as it was working fine - it’s just after that upgrade/rebuild that this happened.

Entering the container and opening Top stats, then going to the site doesn’t seem to make any difference either. In case it’s any help, here’s a screenshot:

If it’s easier for you I’d be happy to ‘save’ the container and send it to you (is that even possible with Docker containers? haha!)

Topic		Replies	Views
Installed OK. Working fine for one day. Suddenly stopped working Installation	14	2573	June 20, 2018
Can't rebuild due to failed postgres 12 upgrade Installation	44	3674	July 13, 2020
Discourse upgrade via Web UI Fails & SSH Upgrade Brings Down Discourse Instance Installation	17	1878	November 26, 2021
Discourse web interface becomes unresponsive a few minutes after starting Installation	35	6701	January 9, 2018
Discourse Bad Gateway after reboot Installation	16	2054	August 1, 2020

Firewall issue with running multiple containers after upgrade

Related topics