重启后 Discourse 出现网关错误

nap · 2020 年7 月 1 日 07:52

我的服务器运行在一家主要云提供商托管的虚拟机上。
我成功在其中安装了 discourse，过去一个月一直运行正常。
今天，我决定将虚拟机的规格恢复为原始配置(*)，并重启了系统。启动后，虽然服务器上其他一切运行正常，但当我尝试访问 Discourse 论坛时，却收到了 502 Bad Gateway 错误。我以为 Docker 实例没有自动启动，于是 SSH 登录到服务器并运行了 ./launcher start app，但系统提示剩余空间不足（仅剩 5GB）。随后我运行了 df -h，结果显示实际上还有 14GB 可用空间。于是我再次运行 ./launcher start app，这次收到了警告，提示 Docker 将要下载一些内容，请稍等。经过一段时间后，系统显示 Nothing to do, your container has already started!（无需操作，您的容器已启动！）。然而，我尝试访问论坛时仍然返回 502 Bad Gateway 错误。

在参考了本论坛后，我决定运行 ./launcher rebuild app，结果出现了与 PostgreSQL 相关的以下错误：

    user@host:[16:48]:/var/discourse# ./launcher rebuild app
    Ensuring launcher is up to date
    Fetching origin
    Launcher is up-to-date
    Stopping old container
    + /usr/bin/docker stop -t 60 app
    app
    cd /pups && git pull && /pups/bin/pups --stdin
    Already up to date.
    I, [2020-07-01T07:19:42.821347 #1]  INFO -- : Loading --stdin
    I, [2020-07-01T07:19:42.831806 #1]  INFO -- : > locale-gen $LANG && update-locale
    I, [2020-07-01T07:19:42.879007 #1]  INFO -- : Generating locales (this might take a while)...
    Generation complete.
    
    I, [2020-07-01T07:19:42.879431 #1]  INFO -- : > mkdir -p /shared/postgres_run
    I, [2020-07-01T07:19:42.885054 #1]  INFO -- :
    I, [2020-07-01T07:19:42.885734 #1]  INFO -- : > chown postgres:postgres /shared/postgres_run
    I, [2020-07-01T07:19:42.891655 #1]  INFO -- :
    I, [2020-07-01T07:19:42.892269 #1]  INFO -- : > chmod 775 /shared/postgres_run
    I, [2020-07-01T07:19:42.898103 #1]  INFO -- :
    I, [2020-07-01T07:19:42.898942 #1]  INFO -- : > rm -fr /var/run/postgresql
    I, [2020-07-01T07:19:42.905607 #1]  INFO -- :
    I, [2020-07-01T07:19:42.906463 #1]  INFO -- : > ln -s /shared/postgres_run /var/run/postgresql
    I, [2020-07-01T07:19:42.912617 #1]  INFO -- :
    I, [2020-07-01T07:19:42.913233 #1]  INFO -- : > socat /dev/null UNIX-CONNECT:/shared/postgres_run/.s.PGSQL.5432 || exit 0 && echo postgres already running stop container ; exit 1
    2020/07/01 07:19:42 socat[26] E connect(6, AF=1 "/shared/postgres_run/.s.PGSQL.5432", 36): No such file or directory
    I, [2020-07-01T07:19:42.925688 #1]  INFO -- :
    I, [2020-07-01T07:19:42.926081 #1]  INFO -- : > rm -fr /shared/postgres_run/.s*
    I, [2020-07-01T07:19:42.931174 #1]  INFO -- :
    I, [2020-07-01T07:19:42.931649 #1]  INFO -- : > rm -fr /shared/postgres_run/*.pid
    I, [2020-07-01T07:19:42.938152 #1]  INFO -- :
    I, [2020-07-01T07:19:42.938850 #1]  INFO -- : > mkdir -p /shared/postgres_run/12-main.pg_stat_tmp
    I, [2020-07-01T07:19:42.943575 #1]  INFO -- :
    I, [2020-07-01T07:19:42.944331 #1]  INFO -- : > chown postgres:postgres /shared/postgres_run/12-main.pg_stat_tmp
    I, [2020-07-01T07:19:42.949159 #1]  INFO -- :
    I, [2020-07-01T07:19:42.961190 #1]  INFO -- : File > /etc/service/postgres/run  chmod: +x  chown:
    I, [2020-07-01T07:19:42.973345 #1]  INFO -- : File > /etc/service/postgres/log/run  chmod: +x  chown:
    I, [2020-07-01T07:19:42.983929 #1]  INFO -- : File > /etc/runit/3.d/99-postgres  chmod: +x  chown:
    I, [2020-07-01T07:19:42.994843 #1]  INFO -- : File > /root/upgrade_postgres  chmod: +x  chown:
    I, [2020-07-01T07:19:42.995487 #1]  INFO -- : > chown -R root /var/lib/postgresql/12/main
    I, [2020-07-01T07:19:44.012812 #1]  INFO -- :
    I, [2020-07-01T07:19:44.013656 #1]  INFO -- : > [ ! -e /shared/postgres_data ] && install -d -m 0755 -o postgres -g postgres /shared/postgres_data && sudo -E -u postgres /usr/lib/postgresql/12/bin/initdb -D /shared/postgres_data || exit 0
    I, [2020-07-01T07:19:44.019545 #1]  INFO -- :
    I, [2020-07-01T07:19:44.019872 #1]  INFO -- : > chown -R postgres:postgres /shared/postgres_data
    I, [2020-07-01T07:19:44.064432 #1]  INFO -- :
    I, [2020-07-01T07:19:44.065186 #1]  INFO -- : > chown -R postgres:postgres /var/run/postgresql
    I, [2020-07-01T07:19:44.071385 #1]  INFO -- :
    I, [2020-07-01T07:19:44.072196 #1]  INFO -- : > /root/upgrade_postgres
    I, [2020-07-01T07:19:44.084004 #1]  INFO -- :
    I, [2020-07-01T07:19:44.084662 #1]  INFO -- : > rm /root/upgrade_postgres
    I, [2020-07-01T07:19:44.090399 #1]  INFO -- :
    I, [2020-07-01T07:19:44.092280 #1]  INFO -- : Replacing data_directory = '/var/lib/postgresql/12/main' with data_directory = '/shared/postgres_data' in /etc/postgresql/12/main/postgresql.conf
    I, [2020-07-01T07:19:44.093969 #1]  INFO -- : Replacing (?-mix:#?listen_addresses *=.*) with listen_addresses = '*' in /etc/postgresql/12/main/postgresql.conf
    I, [2020-07-01T07:19:44.095204 #1]  INFO -- : Replacing (?-mix:#?synchronous_commit *=.*) with synchronous_commit = $db_synchronous_commit in /etc/postgresql/12/main/postgresql.conf
    I, [2020-07-01T07:19:44.095937 #1]  INFO -- : Replacing (?-mix:#?shared_buffers *=.*) with shared_buffers = $db_shared_buffers in /etc/postgresql/12/main/postgresql.conf
    I, [2020-07-01T07:19:44.096695 #1]  INFO -- : Replacing (?-mix:#?work_mem *=.*) with work_mem = $db_work_mem in /etc/postgresql/12/main/postgresql.conf
    I, [2020-07-01T07:19:44.097554 #1]  INFO -- : Replacing (?-mix:#?default_text_search_config *=.*) with default_text_search_config = '$db_default_text_search_config' in /etc/postgresql/12/main/postgresql.conf
    I, [2020-07-01T07:19:44.101971 #1]  INFO -- : > install -d -m 0755 -o postgres -g postgres /shared/postgres_backup
    I, [2020-07-01T07:19:44.112672 #1]  INFO -- :
    I, [2020-07-01T07:19:44.113831 #1]  INFO -- : Replacing (?-mix:#?max_wal_senders *=.*) with max_wal_senders = $db_max_wal_senders in /etc/postgresql/12/main/postgresql.conf
    I, [2020-07-01T07:19:44.114973 #1]  INFO -- : Replacing (?-mix:#?wal_level *=.*) with wal_level = $db_wal_level in /etc/postgresql/12/main/postgresql.conf
    I, [2020-07-01T07:19:44.116047 #1]  INFO -- : Replacing (?-mix:#?checkpoint_segments *=.*) with checkpoint_segments = $db_checkpoint_segments in /etc/postgresql/12/main/postgresql.conf
    I, [2020-07-01T07:19:44.117033 #1]  INFO -- : Replacing (?-mix:#?logging_collector *=.*) with logging_collector = $db_logging_collector in /etc/postgresql/12/main/postgresql.conf
    I, [2020-07-01T07:19:44.118051 #1]  INFO -- : Replacing (?-mix:#?log_min_duration_statement *=.*) with log_min_duration_statement = $db_log_min_duration_statement in /etc/postgresql/12/main/postgresql.conf
    I, [2020-07-01T07:19:44.119352 #1]  INFO -- : Replacing (?-mix:^#local +replication +postgres +peer$) with local replication postgres  peer in /etc/postgresql/12/main/pg_hba.conf
    I, [2020-07-01T07:19:44.120299 #1]  INFO -- : Replacing (?-mix:^host.*all.*all.*127.*$) with host all all 0.0.0.0/0 md5 in /etc/postgresql/12/main/pg_hba.conf
    I, [2020-07-01T07:19:44.121038 #1]  INFO -- : > HOME=/var/lib/postgresql USER=postgres exec chpst -u postgres:postgres:ssl-cert -U postgres:postgres:ssl-cert /usr/lib/postgresql/12/bin/postmaster -D /etc/postgresql/12/main
    I, [2020-07-01T07:19:44.126334 #1]  INFO -- : > sleep 5
    2020-07-01 07:19:44.157 UTC [49] LOG:  starting PostgreSQL 12.2 (Debian 12.2-2.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
    2020-07-01 07:19:44.158 UTC [49] LOG:  listening on IPv4 address "0.0.0.0", port 5432
    2020-07-01 07:19:44.158 UTC [49] LOG:  listening on IPv6 address "::", port 5432
    2020-07-01 07:19:44.161 UTC [49] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
    2020-07-01 07:19:44.162 UTC [49] FATAL:  could not map anonymous shared memory: Cannot allocate memory
    2020-07-01 07:19:44.162 UTC [49] HINT:  This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory, swap space, or huge pages. To reduce the request size (currently 4423172096 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections.
    2020-07-01 07:19:44.162 UTC [49] LOG:  database system is shut down
    I, [2020-07-01T07:19:49.141762 #1]  INFO -- :
    I, [2020-07-01T07:19:49.142221 #1]  INFO -- : > su postgres -c 'createdb discourse' || true
    createdb: error: could not connect to database template1: could not connect to server: No such file or directory
        Is the server running locally and accepting
        connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
    I, [2020-07-01T07:19:49.227852 #1]  INFO -- :
    I, [2020-07-01T07:19:49.228226 #1]  INFO -- : > su postgres -c 'psql discourse -c "create user discourse;"' || true
    psql: error: could not connect to server: could not connect to server: No such file or directory
        Is the server running locally and accepting
        connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
    I, [2020-07-01T07:19:49.330487 #1]  INFO -- :
    I, [2020-07-01T07:19:49.330822 #1]  INFO -- : > su postgres -c 'psql discourse -c "grant all privileges on database discourse to discourse;"' || true
    psql: error: could not connect to server: could not connect to server: No such file or directory
        Is the server running locally and accepting
        connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
    I, [2020-07-01T07:19:49.425970 #1]  INFO -- :
    I, [2020-07-01T07:19:49.426356 #1]  INFO -- : > su postgres -c 'psql discourse -c "alter schema public owner to discourse;"'
    psql: error: could not connect to server: could not connect to server: No such file or directory
        Is the server running locally and accepting
        connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
    I, [2020-07-01T07:19:49.506638 #1]  INFO -- :
    I, [2020-07-01T07:19:49.507202 #1]  INFO -- : Terminating async processes
    
    
    FAILED
    --------------------
    Pups::ExecError: su postgres -c 'psql discourse -c "alter schema public owner to discourse;"' failed with return #<Process::Status: pid 75 exit 2>
    Location of failure: /pups/lib/pups/exec_command.rb:112:in `spawn'
    exec failed with the params "su postgres -c 'psql $db_name -c \"alter schema public owner to $db_user;\"'"
    eb41679f76cd749ccd8c84a7543365d093619b80df6fc6750b9349fb63565fa1
    ** FAILED TO BOOTSTRAP ** please scroll up and look for earlier error messages, there may be more than one.
    ./discourse-doctor may help diagnose the problem.
    user@host:[17:19]:/var/discourse#

奇怪的是，尽管出现了上述错误，运行 ./launcher start app 却没有任何报错：

starting up existing container
+ /usr/bin/docker start app
app

实例运行后，我尝试使用 ./launcher enter app 进入容器。（在我看来，容器内可用的工具非常有限（是的，我是 nano 用户，喜欢配置各种别名，例如 ll）。我无法找到 Docker 实例内文件夹的物理路径（我想通过 FTP 客户端下载它们）。

在 /var/log/nginx/error.log 中，每次刷新浏览器时，我都会看到以下错误条目：

2020/07/01 07:44:16 [error] 646#646: *3 connect() failed (111: Connection refused) while connecting to upstream, client: xxx.xx.0.1, server: _, request: "GET / HTTP/1.1", upstream: "http://127.0.0.1:3000/", host: "discourse.myDomain.com"

我的问题可能是什么原因？为什么 PostgreSQL 突然无法工作了？

(*) 在安装 Discourse 一周后，我升级了服务器，增加了 CPU 和内存。这是为了运行我 hosted 的一场视频会议。会议结束后，我已恢复为正常配置。请注意，我在更改规格的过程中从未更改过磁盘大小。

neounix · 2020 年7 月 1 日 08:51

这是因为您当前的容器重建失败，而您启动的是之前版本的 app。这是正常行为。通常，如果重建不成功，原始容器不会被删除，原始镜像也会保留可用。

关于您的 PG 问题，您需要向团队提供更多关于您的应用和容器配置的详细信息，以获得最佳支持。

nap · 2020 年7 月 1 日 09:32

@neounix：谢谢。

我是第一次搭建 Discourse 论坛，不太清楚该从哪里入手以及需要注意哪些事项。我目前使用的是几乎原生的安装，没有安装任何插件或其他修改。我在 app.yml 中定义了一些变量，并使用现有的 Apache2 守护进程作为反向代理，通过一个独立的虚拟主机将 Discourse 流量转发到我配置的本地端口。

您能详细说明一下哪些信息会有所帮助吗？是否有相关的资源可以帮助我排查当前的问题？

neounix · 2020 年7 月 1 日 10:00

nap:

    2020-07-01 07:19:44.161 UTC [49] LOG:  正在监听 Unix socket "/var/run/postgresql/.s.PGSQL.5432"
    2020-07-01 07:19:44.162 UTC [49] FATAL:  无法映射匿名共享内存：无法分配内存
    2020-07-01 07:19:44.162 UTC [49] HINT:  此错误通常意味着 PostgreSQL 对共享内存段的需求超过了可用内存、交换空间或大页。要减小请求大小（当前为 4423172096 字节），请减少 PostgreSQL 的共享内存使用量，例如通过减少 shared_buffers 或 max_connections。
    2020-07-01 07:19:44.162 UTC [49] LOG:  数据库系统已关闭

核心错误出现在上面运行的日志文件中。

2020-07-01 07:19:44.162 UTC [49] FATAL:  无法映射匿名共享内存：无法分配内存

2020-07-01 07:19:44.162 UTC [49] HINT:  此错误通常意味着 PostgreSQL 对共享内存段的需求超过了可用内存、交换空间或大页。要减小请求大小（当前为 4423172096 字节），请减少 PostgreSQL 的共享内存使用量，例如通过减少 shared_buffers 或 max_connections。

nap · 2020 年7 月 1 日 10:31

我看到了那个错误，但我没有在 app.yml 中进行任何更改。
在哪里可以减小 shared_buffers 或 max_connections？它们不在 app.yml 中。
app.yml 中只有一个参数 db_shared_buffers，但它一直设置为默认值 “4096MB”（在我增加服务器内存之前和之后都是如此）。

neounix · 2020 年7 月 1 日 10:36

您可以考虑发布与内存相关的统计信息。

例如，在 Linux 上：

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          64299       12955        9678         361       41664       50265
Swap:          7807          69        7738

对于 Docker 统计信息，请发布以下命令的输出：

docker stats

等等。

该错误与内存不足有关。

nap · 2020 年7 月 1 日 10:44

服务器内存统计如下：

              total        used        free      shared  buff/cache   available
Mem:           3951        2236         414          86        1299        1308
Swap:           511         415          96

执行 enter app 后的内存统计：

              total        used        free      shared  buff/cache   available
Mem:           3951        2363         321          86        1266        1215
Swap:           511         415          96

运行 docker stats > output.txt 生成结果如下：

        CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT    MEM %               NET I/O             BLOCK I/O           PIDS
       ca4c5f37894c        app                 15.86%              6.48MiB / 3.859GiB   0.16%               20.3kB / 12.6kB     0B / 0B             25
        CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT    MEM %               NET I/O             BLOCK I/O           PIDS
       ca4c5f37894c        app                 15.86%              6.48MiB / 3.859GiB   0.16%               20.3kB / 12.6kB     0B / 0B             25
        CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
       ca4c5f37894c        app                 2.83%               6.539MiB / 3.859GiB   0.17%               20.3kB / 12.6kB     0B / 0B             25
        CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
       ca4c5f37894c        app                 2.83%               6.539MiB / 3.859GiB   0.17%               20.3kB / 12.6kB     0B / 0B             25
        CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
       ca4c5f37894c        app                 3.30%               6.477MiB / 3.859GiB   0.16%               20.3kB / 12.6kB     0B / 0B             25
        CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
       ca4c5f37894c        app                 3.30%               6.477MiB / 3.859GiB   0.16%               20.3kB / 12.6kB     0B / 0B             25
        CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
       ca4c5f37894c        app                 2.45%               6.535MiB / 3.859GiB   0.17%               20.3kB / 12.6kB     0B / 0B             25
        CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
       ca4c5f37894c        app                 2.45%               6.535MiB / 3.859GiB   0.17%               20.3kB / 12.6kB     0B / 0B             25

neounix · 2020 年7 月 1 日 10:55

嗨 @nap

你可以通过停止并删除所有旧的 app 容器来回收大量内存。

例如：

docker stop <container_id>
docker rm <container_id>

假设它们当前未被使用？

如果它们都在使用中，那么你应该考虑将这台服务器的内存增加到 4GB 以上；也许可以升级到 8GB

nap · 2020 年7 月 1 日 11:01

我执行了 ./launcher stop app 停止了应用，然后重新运行了 docker stats。没有列出任何容器。
不幸的是，增加内存意味着需要支付更多费用。目前令人沮丧的是，上个月使用 4GB 内存时它还能正常工作。

nap · 2020 年7 月 1 日 11:02

我目前甚至无法重新构建，这应该不会占用太多内存。

在没有运行容器的情况下，内存统计如下：

              total        used        free      shared  buff/cache   available
Mem:           3951        2207         169          91        1574        1332
Swap:           511         446          65

我在 ./var/lib/docker/overlay2/ 中有几个有趣的目录：

e3e6cdfcc62c2e0b68ec91efxxxxx6c69212c95b5070f7b6b84e97edcb473ea2
64a04d1b97a18f51a5fdc536xxxxxf9473de0c2ccd1a2cc0d62e830164b5f2d8
355303c6af7bebff1163195c5xxxxx8fd1de6333e39adbcb573c7365673b6c85

我可以删除这些吗？

neounix · 2020 年7 月 1 日 11:06

好的。

我明白了。我之前忙于处理另一项任务，没有注意到你的输出显示的是同一个容器的统计信息，而不是多个容器的。

现在你的容器没有运行时，free -m 命令显示了什么？

我认为 4GB 内存对于单个容器来说肯定足够了。

neounix · 2020 年7 月 1 日 11:09

不行。

不要删除那些 Docker 文件。

根据错误信息，问题与你的 Discourse PG 12 配置有关。我不确定如何解决，因为我不认为支持通过调整 PG 12 配置文件来适配 Discourse。

社区中的资深专家会有比我更好的建议，尤其是专业托管团队。

nap · 2020 年7 月 1 日 11:48

您的意思是，这涉及 Docker 配置内部的文件吗？手动修改它是否会在容器启动或更新时引发问题？

neounix · 2020 年7 月 1 日 13:28

@nap

如果您在谷歌上搜索上述错误消息（带引号），您会发现许多与此 PostgreSQL 错误消息直接相关的讨论。

希望这能帮到您。

pfaffman · 2020 年7 月 1 日 16:54

在那之后，你~~重新运行了 ./discourse-setup 还是~~手动修改了 app.yml 中的内存设置？db_shared_buffers、unicorn_workers 和 db_work_mem 分别是什么？

不过你是在反向代理后面运行的，这让情况变得更复杂了。目前还不清楚反向代理是否是导致问题的原因，但它确实增加了复杂性。

你有多个分区吗？有没有可能是 Docker 创建镜像的分区已满？

nap · 2020 年7 月 2 日 02:25

@pfaffman：谢谢你的查看。

没有，我所做的只是添加了一系列与站点名称和标签使用相关的变量定义。

db_shared_buffers 是 “4096MB”
unicorn_workers 是 8
db_work_mem 被注释掉了

我有一个 40G 的主分区（剩余 14GB），512MB 的交换空间，以及一个用于备份的 8G 分区（未挂载）。

看来我已经解决了这个问题。起初我尝试将缓冲区减少到 2GB，将工作进程减少到 4 个，但出现了同样的错误。然后我将缓冲区减少到 1GB，随后 rebuild 成功执行，论坛现已恢复运行。

谢谢大家！！

话题		回复	浏览量
failed to bootstrap Self-hosting	13	1417	2020 年7 月 25 日
Error while doing discourse setup Self-hosting	11	3091	2017 年10 月 6 日
Need support with discourse Update Self-hosting	4	146	2024 年10 月 9 日
Building bootstrap error Self-hosting	46	2202	2022 年12 月 20 日
Forum crashes after upgrading from 2.4.0 Self-hosting	10	1295	2020 年8 月 3 日

重启后 Discourse 出现网关错误

相关话题