Discourse having momentary "downs" - How to get more info from the logs

Hey,

This morning I got some users warn me that apparently our discourse installation was down but by the time I checked it was up again. There were multiple “offline” like behaviour I can see from my monitoring stack but when I went to check the logs via launcher logs app it seems like they are outputted without a datetime (:confused:) and only the most recent I’m guessing, without distinction really from the source of the logs as there are several components that form a discourse installation (nginx, redis, psql, etc).

Is there a way to have more accurate logs or, even better, get them exposed somehow so I could catch them with loki / promtail?

All charts are “last 5 days” as I added this monitoring only back then. This is what I can see in my monitoring:

Container Uptime:

From discourse-prometheus plugins, lots of “holes” which I suppose are from when discourse is not responding as they match when I got notified by users this morning:

Zoomed in last 6 hours to clarify that this is not some “small” window and is definitely not acceptable by the users of course:

Logs ricavati con ./launcher logs app

root@vmi1229594:/var/discourse# ./launcher logs app
x86_64 arch detected.
run-parts: executing /etc/runit/1.d/00-ensure-links
run-parts: executing /etc/runit/1.d/00-fix-var-logs
run-parts: executing /etc/runit/1.d/01-cleanup-web-pids
run-parts: executing /etc/runit/1.d/anacron
run-parts: executing /etc/runit/1.d/cleanup-pids
Cleaning stale PID files
run-parts: executing /etc/runit/1.d/copy-env
Started runsvdir, PID is 41
ok: run: redis: (pid 53) 0s
ok: run: postgres: (pid 55) 0s
supervisor pid: 51 unicorn pid: 87
Shutting Down
run-parts: executing /etc/runit/3.d/01-nginx
ok: down: nginx: 0s, normally up
run-parts: executing /etc/runit/3.d/02-unicorn
(51) exiting
ok: down: unicorn: 1s, normally up
run-parts: executing /etc/runit/3.d/10-redis
ok: down: redis: 0s, normally up
run-parts: executing /etc/runit/3.d/99-postgres
ok: down: postgres: 0s, normally up
ok: down: nginx: 5s, normally up
ok: down: postgres: 0s, normally up
ok: down: redis: 3s, normally up
ok: down: unicorn: 5s, normally up
ok: down: cron: 1s, normally up
ok: down: rsyslog: 1s, normally up
run-parts: executing /etc/runit/1.d/00-ensure-links
run-parts: executing /etc/runit/1.d/00-fix-var-logs
run-parts: executing /etc/runit/1.d/01-cleanup-web-pids
run-parts: executing /etc/runit/1.d/anacron
run-parts: executing /etc/runit/1.d/cleanup-pids
Cleaning stale PID files
run-parts: executing /etc/runit/1.d/copy-env
Started runsvdir, PID is 35
ok: run: redis: (pid 49) 0s
ok: run: postgres: (pid 48) 0s
supervisor pid: 43 unicorn pid: 81
(43) Reopening logs
(43) Reopening logs
Shutting Down
run-parts: executing /etc/runit/3.d/01-nginx
ok: down: nginx: 0s, normally up
run-parts: executing /etc/runit/3.d/02-unicorn
(43) exiting
ok: down: unicorn: 0s, normally up
run-parts: executing /etc/runit/3.d/10-redis
ok: down: redis: 0s, normally up
run-parts: executing /etc/runit/3.d/99-postgres
timeout: run: postgres: (pid 48) 34983s, want down, got TERM
run-parts: /etc/runit/3.d/99-postgres exited with return code 1
ok: down: nginx: 10s, normally up
ok: down: redis: 8s, normally up
ok: down: unicorn: 10s, normally up
ok: down: cron: 0s, normally up
ok: down: rsyslog: 0s, normally up
kill: run: postgres: (pid 48) 34991s, want down, got TERM
run-parts: executing /etc/runit/1.d/00-ensure-links
run-parts: executing /etc/runit/1.d/00-fix-var-logs
run-parts: executing /etc/runit/1.d/01-cleanup-web-pids
run-parts: executing /etc/runit/1.d/anacron
run-parts: executing /etc/runit/1.d/cleanup-pids
Cleaning stale PID files
run-parts: executing /etc/runit/1.d/copy-env
Started runsvdir, PID is 35
fail: redis: runsv not running
ok: run: redis: (pid 53) 1s
ok: run: postgres: (pid 48) 1s
supervisor pid: 79 unicorn pid: 83
(79) Reopening logs
Shutting Down
run-parts: executing /etc/runit/3.d/01-nginx
ok: down: nginx: 0s, normally up
run-parts: executing /etc/runit/3.d/02-unicorn
(79) exiting
ok: down: unicorn: 0s, normally up
run-parts: executing /etc/runit/3.d/10-redis
ok: down: redis: 0s, normally up
run-parts: executing /etc/runit/3.d/99-postgres
ok: down: postgres: 0s, normally up
ok: down: nginx: 5s, normally up
ok: down: postgres: 0s, normally up
ok: down: redis: 3s, normally up
ok: down: unicorn: 5s, normally up
ok: down: cron: 1s, normally up
ok: down: rsyslog: 1s, normally up
run-parts: executing /etc/runit/1.d/00-ensure-links
run-parts: executing /etc/runit/1.d/00-fix-var-logs
run-parts: executing /etc/runit/1.d/01-cleanup-web-pids
run-parts: executing /etc/runit/1.d/anacron
run-parts: executing /etc/runit/1.d/cleanup-pids
Cleaning stale PID files
run-parts: executing /etc/runit/1.d/copy-env
Started runsvdir, PID is 34
ok: run: redis: (pid 49) 0s
ok: run: postgres: (pid 44) 0s
supervisor pid: 41 unicorn pid: 80
(41) Reopening logs
(41) Reopening logs
(41) Reopening logs
(41) Reopening logs
(41) Reopening logs
1 Like

For clarity, I’ll add that I know logs can be in the container /var/logs but nginx, postgres and redis really show nothing abnormal in my experience.

nginx have lots of logs about flooding requests (and thus limiting) for badge json of users coming from topics
postgresql simply return the query executed (I guess? those are some very long queries)
redis show just saving data regularly as it should

in messages I see some not-so-comforting message, like:

[....]
May  1 07:46:50 vmi1229594-app logrotate: ALERT exited abnormally with [1]
May  2 07:56:45 vmi1229594-app logrotate: ALERT exited abnormally with [1]
May  2 23:20:57 vmi1229594-app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="49" x-info="https://www.rsyslog.com"] start
May  3 07:35:27 vmi1229594-app logrotate: ALERT exited abnormally with [1]
May  4 07:38:08 vmi1229594-app logrotate: ALERT exited abnormally with [1]
May  5 07:40:32 vmi1229594-app logrotate: ALERT exited abnormally with [1]
May  6 08:01:40 vmi1229594-app logrotate: ALERT exited abnormally with [1]
May  7 07:38:45 vmi1229594-app logrotate: ALERT exited abnormally with [1]
May  8 07:37:31 vmi1229594-app logrotate: ALERT exited abnormally with [1]
May  9 07:35:21 vmi1229594-app logrotate: ALERT exited abnormally with [1]
May 10 07:53:08 vmi1229594-app logrotate: ALERT exited abnormally with [1]
May 11 07:43:09 vmi1229594-app logrotate: ALERT exited abnormally with [1]
May 12 07:56:06 vmi1229594-app logrotate: ALERT exited abnormally with [1]
May 12 15:15:30 vmi1229594-app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="48" x-info="https://www.rsyslog.com"] start
May 12 15:59:28 vmi1229594-app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="41" x-info="https://www.rsyslog.com"] start
May 12 17:20:56 vmi1229594-app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="43" x-info="https://www.rsyslog.com"] start
May 12 18:55:45 vmi1229594-app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="44" x-info="https://www.rsyslog.com"] start
May 12 19:13:36 vmi1229594-app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="42" x-info="https://www.rsyslog.com"] start
May 12 21:04:24 vmi1229594-app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="44" x-info="https://www.rsyslog.com"] start
May 12 22:15:46 vmi1229594-app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="47" x-info="https://www.rsyslog.com"] start
May 13 07:43:36 vmi1229594-app logrotate: ALERT exited abnormally with [1]
May 13 20:07:44 vmi1229594-app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="49" x-info="https://www.rsyslog.com"] start
May 14 07:46:22 vmi1229594-app logrotate: ALERT exited abnormally with [1]
May 14 22:05:18 vmi1229594-app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="42" x-info="https://www.rsyslog.com"] start
May 14 22:16:04 vmi1229594-app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="44" x-info="https://www.rsyslog.com"] start
May 14 22:43:03 vmi1229594-app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="49" x-info="https://www.rsyslog.com"] start
May 14 23:00:09 discourse_app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="49" x-info="https://www.rsyslog.com"] start
May 15 00:22:59 discourse_app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="45" x-info="https://www.rsyslog.com"] start
May 15 00:56:17 discourse_app logrotate: ALERT exited abnormally with [1]
May 15 10:06:13 discourse_app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="43" x-info="https://www.rsyslog.com"] start
May 16 07:55:46 discourse_app logrotate: ALERT exited abnormally with [1]
May 16 08:40:17 discourse_app rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="45" x-info="https://www.rsyslog.com"] start
May 17 07:58:24 discourse_app logrotate: ALERT exited abnormally with [1]
May 18 07:55:33 discourse_app logrotate: ALERT exited abnormally with [1]
May 19 07:48:14 discourse_app logrotate: ALERT exited abnormally with [1]
May 20 08:03:44 discourse_app logrotate: ALERT exited abnormally with [1]
May 21 07:40:15 discourse_app logrotate: ALERT exited abnormally with [1]
May 22 07:44:59 discourse_app logrotate: ALERT exited abnormally with [1]

And in syslog lots and lots of this:

[....]
May 20 05:17:01 discourse_app CRON[1001623]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 20 05:25:01 discourse_app CRON[1002727]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 05:35:01 discourse_app CRON[1004446]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 05:45:01 discourse_app CRON[1006215]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 05:55:01 discourse_app CRON[1007612]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 06:05:01 discourse_app CRON[1009398]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 06:15:01 discourse_app CRON[1011120]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 06:17:01 discourse_app CRON[1011400]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 20 06:25:01 discourse_app CRON[1012535]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 06:25:01 discourse_app CRON[1012537]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
May 20 06:35:01 discourse_app CRON[1014375]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 06:45:01 discourse_app CRON[1016178]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 06:55:01 discourse_app CRON[1017757]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 07:05:02 discourse_app CRON[1019550]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 07:15:01 discourse_app CRON[1021373]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 07:17:01 discourse_app CRON[1021667]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 20 07:25:01 discourse_app CRON[1022811]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 07:30:01 discourse_app CRON[1023931]: (root) CMD (/usr/sbin/anacron -s >/dev/null)
May 20 07:30:01 discourse_app anacron[1023933]: Anacron 2.3 started on 2023-05-20
May 20 07:30:01 discourse_app anacron[1023933]: Will run job `cron.daily' in 5 min.
May 20 07:30:01 discourse_app anacron[1023933]: Jobs will be executed sequentially
May 20 07:35:01 discourse_app anacron[1023933]: Job `cron.daily' started
May 20 07:35:01 discourse_app anacron[1024646]: Updated timestamp for job `cron.daily' to 2023-05-20
May 20 07:35:01 discourse_app CRON[1024649]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 07:45:01 discourse_app CRON[1026439]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 07:55:01 discourse_app CRON[1027921]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 08:03:44 discourse_app logrotate: ALERT exited abnormally with [1]
May 20 08:03:44 discourse_app anacron[1023933]: Job `cron.daily' terminated (exit status: 1) (mailing output)
May 20 08:03:44 discourse_app anacron[1023933]: Can't find sendmail at /usr/sbin/sendmail, not mailing output
May 20 08:03:44 discourse_app anacron[1023933]: Normal exit (1 job run)
May 20 08:05:01 discourse_app CRON[1029819]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 08:15:01 discourse_app CRON[1031611]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 20 08:17:01 discourse_app CRON[1031886]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
[....]

Main question: What those messages means in the context of Discourse and how do I get more info?
Side note: Is there a way to get these logs out so I can pick them up and centralize them?

launcher logs is simply a wrapper for docker logs, which can show timestamps with -t. The following should give you timestamped logs:

sudo docker logs -t --tail 1000 app

The lines you quoted in those two files are not related to Discourse; if you figured so because there’s discourse_app in some of those lines, keep in mind that is a hostname and not a label indictating they are somehow related to the Discourse Docker setup.

Based on them, it looks like logrotate is failing, but you won’t see a detailed error unless you install a local mail setup. Assuming a Debian derivative, do apt install default-mta, then wait until another error, then check local mail with the mail command.

I’d also check server health like available disk space, memory pressure, etc. Extreme memory pressure (ie. paging) is a likely culprit for recurrent non responsiveness.

6 Likes

Personally I’d look through the relevant parts of the output of dmesg - one possibility being that firewall (UFW) events are blocking traffic.

In fact, some logging of connectivity might be useful, to distinguish the case that the machine itself is unreachable, and the case that discourse is unresponsive.

If you have any kind of CDN arrangement, check that too.

2 Likes

Thank you Leonardo, I’ve added postfix (default for ubuntu). We’ll see what comes out of it.

I’ve other monitoring in place and to be honest I don’t see any issue with memory or disk space.

Swap stays around 2GB of 8GB available. The VM has 30GB of RAM available. What really is weird to me is how discourse is greedy with it → Discourse Docker HW reserved/used (CPU, RAM, Disk) and how to manage it

I’m not experienced with dmesg but what I can see is a plethora of [UFW BLOCK] messages from several different IPs but of course being so many it’s hard to understand if there is a pattern.

To give you an example:

[Tue May 23 09:32:21 2023] [UFW BLOCK] IN=eth0 OUT= MAC=MAC_ADDRESS_A SRC=IP_ADDRESS_A DST=SERVER_IP LEN=40 TOS=0x00 PREC=0x00 TTL=248 ID=54321 PROTO=TCP SPT=34909 DPT=40930 WINDOW=65535 RES=0x00 SYN URGP=0
[Tue May 23 09:32:22 2023] [UFW BLOCK] IN=eth0 OUT= MAC=MAC_ADDRESS_A SRC=IP_ADDRESS_A DST=SERVER_IP LEN=40 TOS=0x00 PREC=0x00 TTL=248 ID=54321 PROTO=TCP SPT=43093 DPT=40942 WINDOW=65535 RES=0x00 SYN URGP=0
[Tue May 23 09:32:29 2023] [UFW BLOCK] IN=eth0 OUT= MAC=MAC_ADDRESS_A SRC=IP_ADDRESS_B DST=SERVER_IP LEN=40 TOS=0x00 PREC=0x00 TTL=249 ID=57687 PROTO=TCP SPT=42801 DPT=3350 WINDOW=1024 RES=0x00 SYN URGP=0
[Tue May 23 09:32:35 2023] [UFW BLOCK] IN=eth0 OUT= MAC=MAC_ADDRESS_A SRC=IP_ADDRESS_C DST=SERVER_IP LEN=40 TOS=0x00 PREC=0x00 TTL=54 ID=61548 PROTO=TCP SPT=21721 DPT=23 WINDOW=43065 RES=0x00 SYN URGP=0
[Tue May 23 09:32:59 2023] [UFW BLOCK] IN=eth0 OUT= MAC=MAC_ADDRESS_A SRC=IP_ADDRESS_D DST=SERVER_IP LEN=44 TOS=0x00 PREC=0x00 TTL=114 ID=0 PROTO=TCP SPT=50293 DPT=1023 WINDOW=29200 RES=0x00 SYN URGP=0

Identifiers are anonymised but if they are the same the have the same reference.

We do use Cloudflare but just as an SSL/domain provider and cache. Unfortunately I’m not in charge of that so before digging further in that direction I’d like to exhaust other possibilities.

I’ve added an uptime check via blackbox exporter that point to the domain to see if there is any downtime detected.

Yes, similar for me. The only ones we care about, I think, are to ports 80 and 443. Perhaps try
dmesg | egrep "DPT=80 |DPT=443 " | egrep PROTO=TCP
and it’s quite likely there’ll be nothing there. But if there is, that may mean the firewall is blocking access to discourse.

Yep, nothing for me too.

Anyway, judging by the monitoring, it seems like the container is up at all time and the home page as well but it’s literally the app not responding to requests and the fact that there is noData period in those charts confirm that theory to be honest. What that means is that the home page would probably load (it’s the category for us) as it’s cached by cloudflare, but then browsing simply goes in timeout requests.

Last 24h

image

I’ll try and add a new uptime check on the /latest page that is most probably not cacheable due to the content being nearly-constantly changing.

CloudFlare shouldn’t be caching any of your pages. Discourse is an app, not a website.

1 Like

These outages are around an hour or more, and clustered around midnight - does that match your observations?