Hi!
I check yesterday and the site was fine, I woke up this morning and its down with the 502 Bad Gateway error. Heres the monitor.log monitor.txt (9.4 MB)
whoaā¦ why so much java ? looks like my jitsi server once
It also seems youāre using your host as a mail server. I finally separated mine from the app, this way it is easier to restore eventually (and keeping it tidy, as close as possible to a standard prod install). A very small instance for the mail with a floating IP might also be easier to monitor for reputation and marginally expensive.
OK, letās seeā¦ first we grep for average:
06:24:13 up 3 days, 8:13, 0 users, load average: 0.01, 0.05, 0.08
06:34:33 up 3 days, 8:23, 0 users, load average: 12.89, 11.68, 6.35
So something blew up, in terms of number of runnable processes, a bit before 06:34.
A slightly wider grep and we see that the memory usage hasnāt really shifted:
06:24:13 up 3 days, 8:13, 0 users, load average: 0.01, 0.05, 0.08
total used free shared buff/cache available
Mem: 8167548 3513172 3241404 371940 1412972 4512016
Swap: 10485756 0 10485756
--
06:34:33 up 3 days, 8:23, 0 users, load average: 12.89, 11.68, 6.35
total used free shared buff/cache available
Mem: 8167548 3110612 1351844 373268 3705092 4812672
Swap: 10485756 0 10485756
In fact if anything memory usage went down, so perhaps something died:
06:13:53 up 3 days, 8:02, 0 users, load average: 0.04, 0.16, 0.15
total used free shared buff/cache available
Mem: 8167548 3505540 3259744 371932 1402264 4519680
06:24:13 up 3 days, 8:13, 0 users, load average: 0.01, 0.05, 0.08
total used free shared buff/cache available
Mem: 8167548 3513172 3241404 371940 1412972 4512016
06:34:33 up 3 days, 8:23, 0 users, load average: 12.89, 11.68, 6.35
total used free shared buff/cache available
Mem: 8167548 3110612 1351844 373268 3705092 4812672
06:44:53 up 3 days, 8:33, 0 users, load average: 13.64, 13.25, 9.89
total used free shared buff/cache available
Mem: 8167548 3093212 3618984 373276 1455352 4840140
06:55:13 up 3 days, 8:44, 0 users, load average: 11.62, 12.70, 11.42
total used free shared buff/cache available
Mem: 8167548 3140580 3366628 373352 1660340 4792440
Looks like the unicorns restarted:
Thu Aug 5 06:24:13 CEST 2021
06:24:13 up 3 days, 8:13, 0 users, load average: 0.01, 0.05, 0.08
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 30389 0.0 0.0 2160 1148 ? Ss Aug03 0:00 \_ runsv unicorn
x 13329 0.0 0.0 15132 3712 ? S Aug04 0:26 \_ /bin/bash config/unicorn_launcher -E production -c config/unicorn.conf.rb
x 13365 0.0 3.0 542036 251584 ? Sl Aug04 0:23 \_ unicorn master -E production -c config/unicorn.conf.rb
x 13846 0.6 4.5 9129356 371852 ? SNl Aug04 4:19 | \_ sidekiq 6.2.1 discourse [0 of 5 busy]
x 13858 0.0 3.9 8971616 323264 ? Sl Aug04 0:18 | \_ unicorn worker[0] -E production -c config/unicorn.conf.rb
x 13870 0.0 4.0 8975712 330660 ? Sl Aug04 0:20 | \_ unicorn worker[1] -E production -c config/unicorn.conf.rb
x 13889 0.0 4.1 8979808 337180 ? Sl Aug04 0:20 | \_ unicorn worker[2] -E production -c config/unicorn.conf.rb
x 29760 0.0 0.0 13708 2220 ? S 06:24 0:00 \_ sleep 1
Thu Aug 5 06:34:33 CEST 2021
06:34:33 up 3 days, 8:23, 0 users, load average: 12.89, 11.68, 6.35
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 30389 0.0 0.0 2160 1148 ? Ss Aug03 0:00 \_ runsv unicorn
x 4638 0.4 0.0 15132 3784 ? S 06:32 0:00 \_ /bin/bash config/unicorn_launcher -E production -c config/unicorn.conf.rb
x 4642 11.1 2.8 439636 233240 ? Sl 06:32 0:11 \_ unicorn master -E production -c config/unicorn.conf.rb
x 4822 13.1 3.7 8877408 309844 ? Sl 06:33 0:10 | \_ unicorn worker[1] -E production -c config/unicorn.conf.rb
x 5025 15.6 3.7 8873312 310264 ? Sl 06:33 0:10 | \_ unicorn worker[0] -E production -c config/unicorn.conf.rb
x 5065 20.6 3.9 8922504 320740 ? SNl 06:33 0:11 | \_ sidekiq 6.2.1 discourse [0 of 5 busy]
x 5639 0.0 0.0 13708 2264 ? S 06:34 0:00 \_ sleep 1
And it looks like perhaps apache is having trouble starting:
Thu Aug 5 06:24:13 CEST 2021
06:24:13 up 3 days, 8:13, 0 users, load average: 0.01, 0.05, 0.08
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 13730 0.0 0.1 225016 15840 ? Ss Aug02 0:17 /usr/sbin/apache2 -k start
www-data 10232 0.0 0.1 1352372 10972 ? Sl Aug03 0:00 \_ /usr/sbin/apache2 -k start
www-data 27998 0.0 0.1 228880 9608 ? S Aug04 0:01 \_ /usr/sbin/apache2 -k start
www-data 28000 0.0 0.3 3036536 27796 ? Sl Aug04 0:53 \_ /usr/sbin/apache2 -k start
www-data 28002 0.0 0.3 2904772 26524 ? Sl Aug04 0:50 \_ /usr/sbin/apache2 -k start
www-data 28185 0.0 0.1 762472 10952 ? Sl Aug04 0:00 \_ /usr/sbin/apache2 -k start
Thu Aug 5 06:34:33 CEST 2021
06:34:33 up 3 days, 8:23, 0 users, load average: 12.89, 11.68, 6.35
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 13730 0.0 0.1 225024 15856 ? Ss Aug02 0:17 /usr/sbin/apache2 -k start
www-data 10232 0.0 0.1 1352372 10972 ? Sl Aug03 0:00 \_ /usr/sbin/apache2 -k start
www-data 28185 0.0 0.1 762472 10952 ? Sl Aug04 0:00 \_ /usr/sbin/apache2 -k start
www-data 30794 0.0 0.1 228888 9620 ? S 06:25 0:00 \_ /usr/sbin/apache2 -k start
www-data 31490 0.0 0.1 344564 10852 ? Sl 06:25 0:00 \_ /usr/sbin/apache2 -k start
www-data 32095 0.0 0.1 500228 10888 ? Sl 06:25 0:00 \_ /usr/sbin/apache2 -k start
www-data 32215 83.2 0.1 369152 10876 ? Sl 06:25 7:26 \_ /usr/sbin/apache2 -k start
www-data 32284 0.0 0.1 229400 9936 ? S 06:25 0:00 \_ /usr/sbin/apache2 -k start
www-data 32382 187 0.1 410136 10916 ? Sl 06:25 16:31 \_ /usr/sbin/apache2 -k start
www-data 32410 0.0 0.1 229400 9936 ? S 06:25 0:00 \_ /usr/sbin/apache2 -k start
www-data 32545 0.0 0.1 229400 9936 ? S 06:25 0:00 \_ /usr/sbin/apache2 -k start
www-data 2106 0.0 0.1 3032348 11396 ? Sl 06:27 0:00 \_ /usr/sbin/apache2 -k start
www-data 2240 0.0 0.1 2901012 11416 ? Sl 06:27 0:00 \_ /usr/sbin/apache2 -k start
www-data 2241 0.0 0.1 2901184 14220 ? Sl 06:27 0:00 \_ /usr/sbin/apache2 -k start
Thu Aug 5 06:44:53 CEST 2021
06:44:53 up 3 days, 8:33, 0 users, load average: 13.64, 13.25, 9.89
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 13730 0.0 0.1 225024 15856 ? Ss Aug02 0:17 /usr/sbin/apache2 -k start
www-data 10232 0.0 0.1 1352372 10972 ? Sl Aug03 0:00 \_ /usr/sbin/apache2 -k start
www-data 28185 0.0 0.1 762472 10952 ? Sl Aug04 0:00 \_ /usr/sbin/apache2 -k start
www-data 30794 0.0 0.1 228888 9620 ? S 06:25 0:00 \_ /usr/sbin/apache2 -k start
www-data 31490 0.0 0.1 344564 10852 ? Sl 06:25 0:00 \_ /usr/sbin/apache2 -k start
www-data 32095 0.0 0.1 500228 10888 ? Sl 06:25 0:00 \_ /usr/sbin/apache2 -k start
www-data 32215 82.1 0.1 369152 10876 ? Sl 06:25 15:50 \_ /usr/sbin/apache2 -k start
www-data 32284 0.0 0.1 229400 9936 ? S 06:25 0:00 \_ /usr/sbin/apache2 -k start
www-data 32382 188 0.1 410136 10916 ? Sl 06:25 36:10 \_ /usr/sbin/apache2 -k start
www-data 32410 0.0 0.1 229400 9936 ? S 06:25 0:00 \_ /usr/sbin/apache2 -k start
www-data 32545 0.0 0.1 229400 9936 ? S 06:25 0:00 \_ /usr/sbin/apache2 -k start
www-data 2106 0.0 0.1 3032348 11396 ? Sl 06:27 0:00 \_ /usr/sbin/apache2 -k start
www-data 2240 0.0 0.1 2901012 11416 ? Sl 06:27 0:00 \_ /usr/sbin/apache2 -k start
www-data 2241 0.0 0.1 2901184 14432 ? Sl 06:27 0:00 \_ /usr/sbin/apache2 -k start
Notice how there are lots of forked apache processes and one of them is burning 83% CPU and another burning 188%. Something is amiss there.
But, Iām not sure what to do about this finding.
Might also be worth seeing if the kernel had reason to kill anything recently:
dmesg -T | egrep -i kille
I get nothing from running that command
I think its cause I had a friend switch me to MPM Event from MPM prefork. Apache has been using a lot more since then but no sites have crashed or messed up since discourse. I just changed back to MPM prefork. Maybe thatll help?
Ive found out the java is apparently Varnish cache. I opened the tree for the process and the base was Varnish which makes a lot more sense than java
MPM Event allows Apache to run multi threaded, and thereby serving more visitors. It has to be paired with a separate PHP service, now commonly PHP-FPM. Prefork will restrict Apacheās and PHPās performance if your server is busy. If Event is using too much, it is most likely a poorly configured FPM over Event, but an unoptimized MPM Event could be a problem, too.
Prefork works fine for servers without a lot of simultaneous traffic, but the standard for Apache is Event/FPM nowadays.
I think thatās a good thingā¦
Ok thank you for the help
I changed back to MPM prefork and apache hasnt crashed in 3 days. Im hoping this is it and its all fixed. Thanks for all the help
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.