打开某些用户资料时出现502错误

volas · 2024 年5 月 24 日 21:01

服务器在使用某些随机用户的 summary.json 时遇到问题，并返回 502 代码。可能是在上次更新后开始的，但不确定。

我不熟悉 discourse 堆栈，在查找日志方面有困难。已编辑以隐藏用户名和 IP，时间可能有所不同，但大多数情况下是相同的个人资料页面请求重复多次。

来自应用程序 /var/log/nginx/error.log：

[error] 560#560: *416 upstream prematurely closed connection while reading response header from upstream, client: *.*.*.*, server: _, request: "GET /u/user/summary.json HTTP/2.0", upstream: "http://127.0.0.1:3000/u/user/summary.json", host: "forum.taucetistation.org", referrer: "https://forum.taucetistation.org/u/user"

/shared/log/rails/production_errors.log 为空，而 /shared/log/rails/production.log 只包含

Started GET "/u/user/summary.json" for *.*.*.* at 2024-05-24 20:26:56 +0000
Processing by UsersController#summary as JSON
  Parameters: {"username"=>"user"}

对于类似的请求没有“Completed 200 OK”。

独角兽中似乎没有相关的日志（只是启动时与 redis 的连接错误，但最终似乎成功了，并且不相关）。

但是，来自 /var/log/postgres/current 的一个请求持续时间很长，这很可疑：

2024-05-24 20:49:12.727 UTC [2919] discourse@discourse LOG:  duration: 95288.368 ms  execute <unnamed>: SELECT replies.user_id, COUNT(*) FROM "posts" INNER JOIN "topics" "topics_posts" ON "topics_posts"."deleted_at" IS NULL AND "topics_posts"."id" = "posts"."topic_id" JOIN posts replies ON posts.topic_id = replies.topic_id AND posts.reply_to_post_number = replies.post_number JOIN topics ON replies.topic_id = topics.id AND topics.archetype <> 'private_message' AND replies.post_type IN (1) WHERE "posts"."deleted_at" IS NULL AND (posts.post_type IN (1,4)) AND "topics"."deleted_at" IS NULL AND (topics.archetype <> 'private_message') AND "topics"."visible" = TRUE AND (topics.category_id IS NULL OR topics.category_id IN (SELECT id FROM categories WHERE NOT read_restricted OR id IN (4,8,20,46,55,60,62,67))) AND "posts"."user_id" = 5318 AND (replies.user_id <> posts.user_id) GROUP BY "replies"."user_id" ORDER BY COUNT(*) DESC LIMIT 6
2024-05-24 20:49:12.728 UTC [2919] discourse@discourse LOG:  could not send data to client: Broken pipe
2024-05-24 20:49:12.729 UTC [2919] discourse@discourse FATAL:  connection to client lost

volas · 2024 年5 月 24 日 21:17

似乎每个活跃用户配置文件都存在问题。有时需要很长时间，有时因为缓存很快。\n\n有时会超时并抛出 502 错误。\n\n我们只将此 Dropped from DO 用于论坛。\n\n\n不多，但在过去对我们来说已经足够了。

Arkshine · 2024 年5 月 24 日 21:29

我看到 summary.json 加载需要 20 到 30 秒。

另外，控制台中出现这些错误：

有时确实会出现 502 错误：

查看安全模式，似乎与插件/主题组件无关。

我看到您在使用 Cloudflare。您能尝试禁用任何优化设置吗？

volas · 2024 年5 月 24 日 21:35

已禁用 CF 代理。

但这看起来不像前端问题，更像是后端繁重的数据库请求

volas · 2024 年5 月 24 日 21:38

我们的插件列表

hooks:
  after_code:
    - exec:
        cd: $home/plugins
        cmd:
          - git clone https://github.com/discourse/docker_manager.git
          - git clone https://github.com/discourse/discourse-patreon.git
          - git clone https://github.com/discourse/discourse-solved.git
          - git clone https://github.com/discourse/discourse-staff-notes.git
          - git clone https://github.com/discourse/discourse-spoiler-alert.git
          - git clone https://github.com/discourse/discourse-bbcode.git
          - git clone https://github.com/discourse/discourse-signatures.git
          - git clone https://github.com/discourse/discourse-github
          - git clone https://github.com/discourse/discourse-whos-online.git
          - git clone https://github.com/discourse/discourse-chat-integration.git
          - git clone https://github.com/discourse/discourse-assign
          - git clone https://github.com/discourse/discourse-question-answer.git

Arkshine · 2024 年5 月 24 日 22:10

我不是专家，但如今 Discourse 往往需要更多内存（至少 2 GB，取决于你的设置）。你配置了交换空间吗？

你在 /logs 路由中看到任何相关信息吗？

从插件来看，我第一眼并没有发现有问题的插件。

希望有人能提供更好的见解。

volas · 2024 年5 月 25 日 02:16

是的，4gb 交换空间。/logs 中只有一些警告，例如弃用通知或关于 geolite 的信息。

2024-05-24 20:49:12.727 UTC [2919] discourse@discourse LOG:  duration: 95288.368 ms  execute <unnamed>: SELECT replies.user_id, COUNT(*) FROM "posts" INNER JOIN "topics" "topics_posts" ON "topics_posts"."deleted_at" IS NULL AND "topics_posts"."id" = "posts"."topic_id" JOIN posts replies ON posts.topic_id = replies.topic_id AND posts.reply_to_post_number = replies.post_number JOIN topics ON replies.topic_id = topics.id AND topics.archetype <> 'private_message' AND replies.post_type IN (1) WHERE "posts"."deleted_at" IS NULL AND (posts.post_type IN (1,4)) AND "topics"."deleted_at" IS NULL AND (topics.archetype <> 'private_message') AND "topics"."visible" = TRUE AND (topics.category_id IS NULL OR topics.category_id IN (SELECT id FROM categories WHERE NOT read_restricted OR id IN (4,8,20,46,55,60,62,67))) AND "posts"."user_id" = 5318 AND (replies.user_id <> posts.user_id) GROUP BY "replies"."user_id" ORDER BY COUNT(*) DESC LIMIT 6

据我理解，这是“回复最多”的请求，并且它最近在 PERF: Improve performance of `most_replied_to_users` by janzenisaac · Pull Request #26373 · discourse/discourse · GitHub 中得到了更改，所以也许这个补丁影响了我们。

我运行了 EXPLAIN ANALYZE，希望能有所帮助：

 Limit  (cost=2552.39..2552.39 rows=1 width=12) (actual time=94412.810..94412.824 rows=6 loops=1)
   ->  Sort  (cost=2552.39..2552.39 rows=1 width=12) (actual time=94412.804..94412.808 rows=6 loops=1)
         Sort Key: (count(*)) DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  GroupAggregate  (cost=2552.36..2552.38 rows=1 width=12) (actual time=94412.361..94412.743 rows=160 loops=1)
               Group Key: replies.user_id
               ->  Sort  (cost=2552.36..2552.36 rows=1 width=4) (actual time=94412.323..94412.405 rows=729 loops=1)
                     Sort Key: replies.user_id
                     Sort Method: quicksort  Memory: 59kB
                     ->  Nested Loop  (cost=12.24..2552.35 rows=1 width=4) (actual time=0.279..94409.497 rows=729 loops=1)
                           Join Filter: (topics_posts.id = topics.id)
                           ->  Nested Loop  (cost=1.13..2507.59 rows=4 width=16) (actual time=0.164..94354.548 rows=746 loops=1)
                                 Join Filter: ((replies.user_id <> posts.user_id) AND (posts.reply_to_post_number = replies.post_number))
                                 Rows Removed by Join Filter: 15395502
                                 ->  Nested Loop  (cost=0.71..1419.07 rows=131 width=16) (actual time=0.113..240.052 rows=1399 loops=1)
                                       ->  Index Scan using idx_posts_user_id_deleted_at on posts  (cost=0.42..561.22 rows=135 width=12) (actual time=0.077..59.712 rows=1408 loops=1)
                                             Index Cond: (user_id = 5318)
                                             Filter: (post_type = ANY ('{1,4}'::integer[]))
                                       ->  Index Only Scan using index_topics_on_id_and_deleted_at on topics topics_posts  (cost=0.29..6.35 rows=1 width=4) (actual time=0.123..0.123 rows=1 loops=1408)
                                             Index Cond: ((id = posts.topic_id) AND (deleted_at IS NULL))
                                             Heap Fetches: 1394
                                 ->  Index Scan using index_posts_on_topic_id_and_reply_to_post_number on posts replies  (cost=0.42..7.77 rows=36 width=12) (actual time=0.098..65.083 rows=11005 loops=1399)
                                       Index Cond: (topic_id = topics_posts.id)
                                       Filter: (post_type = 1)
                                       Rows Removed by Filter: 4
                           ->  Index Scan using topics_pkey on topics  (cost=11.11..11.18 rows=1 width=4) (actual time=0.065..0.065 rows=1 loops=746)
                                 Index Cond: (id = replies.topic_id)
                                 Filter: ((deleted_at IS NULL) AND visible AND ((archetype)::text <> 'private_message'::text) AND ((archetype)::text <> 'private_message'::text) AND ((category_id IS NULL) OR (hashed SubPlan 1)))
                                 Rows Removed by Filter: 0
                                 SubPlan 1
                                   ->  Seq Scan on categories  (cost=0.00..10.74 rows=31 width=4) (actual time=0.012..0.066 rows=37 loops=1)
                                         Filter: ((NOT read_restricted) OR (id = ANY ('{4,8,20,46,55,60,62,67}'::integer[])))
 Planning Time: 3.895 ms
 Execution Time: 94416.415 ms
(34 rows)

Ed_S · 2024 年5 月 25 日 06:39

在执行此操作时，在服务器上（通过 SSH 连接）运行 vmstat 5 可能会很有用，以便查看分页可能发生了什么。（可以使用 top，但在这里用处不大）

pfaffman · 2024 年5 月 25 日 07:47

你有多少内存？你的数据库有多大？你可能只需要更多的内存。

volas · 2024 年5 月 25 日 15:24

看起来 PostgreSQL 会进行一些优化/缓存，所以有时它不会花费太长时间，因此我需要寻找新的用户来测试请求。

  procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st gu
 0  0 287852  75220  17660 587788    6    8  4685   244  496    9  6  3 88  3  0  0
 0  0 287852  74048  17668 587836    0    0     0    18  209  465  1  2 96  0  0  0
 0  1 287852  82496  13452 582360    2   13 39937    38 4186 7339  8 16 14 62  0  0
 0  1 288620  77924   1020 601604    2  138 64178   166 5352 8263  9 22  0 68  0  0
 1  0 290156  80472    968 599840    1  313 37518    42 3612 5160 22 37  0 41  0  0
 1  0 289900  67588   3720 609896    4   29 13800    75 2062 2635 43 49  0  8  0  0
 1  0 291180  76816   2368 606924    0  245 30042    27 3414 5477 34 39  0 27  0  0
 2  0 292716  67184   1696 622528    4  318 44820    18 4152 6198 23 30  0 47  0  0
 1  0 293228  75100   3008 614144    0  132 10030     4 1803 1986 39 49  0 11  0  0
 1  0 293484  76020   2860 613444    1   34  3331    10 1197  767 44 54  0  2  0  0
 1  0 294508  66296   2868 623376    2  218  4994    24 1389 1206 40 54  0  6  0  0
 1  0 295020  75432   2892 614384    0  120  1304    58 1161  728 43 55  0  2  0  0
 1  0 295020  73084   2904 615984    0    0   314    11 1060  448 43 57  0  0  0  0
 1  0 295020  68864   3148 619388    6   33  8209    63  663 1190 10 13 74  3  0  0

       Name       |  Owner   | Encoding | Locale Provider |   Collate   |    Ctype    | ICU Locale | ICU Rules |   Access privileges    |  Size   | Tablespace | Description
----------------+----------+----------+-----------------+-------------+-------------+------------+-----------+------------------------+---------+------------+-------------
 discourse      | postgres | UTF8     | libc            | ru_RU.UTF-8 | ru_RU.UTF-8 |            |           | =Tc/postgres          +| 9493 MB | pg_default |

2GB 内存 + 4GB 交换空间，我们正在使用 DigitalOcean 的这个（旧 CPU）droplet

volas · 2024 年5 月 25 日 16:29

我想尝试用来自 PERF: Improve performance of `most_replied_to_users` (#26373) · discourse/discourse@db10dd5 · GitHub 的已撤销更改来测试它，但尽管我在 discourse 容器中编辑了 app/models/user_summary.rb，我仍然看到个人资料页面发出了带有三个 JOIN 的请求。

我需要清理某个地方的缓存吗？或者有更好的方法吗？

更新：容器重启有帮助，撤销更改后似乎运行得更好了！个人资料页面加载更快，没有 502 错误。

Arkshine · 2024 年5 月 25 日 16:59

我确认；现在速度非常快，而且我没有看到任何错误。

Ed_S · 2024 年5 月 26 日 18:05

感谢您提供 vmstat 输出。对我来说，这表明内存没有被过度分配——Linux 没有进行交换——但数据库未能分配到足够的 RAM——它正在进行大量的磁盘 I/O。

所以我认为这种情况需要更多的 RAM，而添加交换空间帮助不大。

所以您也撤销了更改，并且发现性能恢复正常了？

Arkshine · 2024 年5 月 26 日 18:41

不是我，Alexander，在他上面的帖子中。我想确认一下，因为我之前在他的论坛上进行过测试。是的，撤销该提交后，性能问题消失了。

volas · 2024 年5 月 26 日 19:16

也许也是 CPU 问题。但除了 discourse 在运行一些维护任务时，我们以前几乎从未遇到过性能问题。只是现在遇到了个人资料页面问题，所以我才考虑最近的更新。

但我应该说，在尝试回滚之前，我还调整了 app.yml：

出于某种原因，我们设置了 db_shared_buffers: \"128MB\"，这比推荐值少四倍，可能是当初我第一次设置论坛时遗留下来的。我已经注释掉了这一行，这样 discourse 可以根据主机 RAM 自动设置。
我还取消注释了 db_work_mem: \"40MB\"

但我进行了正确的重建并首先进行了测试，但这并没有帮助我们。只有在回滚之后情况才有所改善。

刚才再次在未回滚的情况下进行了测试，以防万一，我仍然在个人资料页面上看到了 502 错误。

如果需要，我可以进行更多有/无补丁的测试。

Ed_S · 2024 年5 月 26 日 19:23

彻底的测试！

也许是，但看看花费在等待上的时间，我认为这主要是关于 I/O。一旦 I/O 瓶颈得到解决，vmstat 就可以显示 CPU 是否已满。

pfaffman · 2024 年5 月 26 日 23:12

它不会。您可以运行 discourse-setup，它会进行设置（您可以在第一条消息后按 Ctrl+C 退出），或者手动输入正确的数字。

增加此值通常很有帮助，但前提是您有足够的内存。

听起来您最初的问题可能是源查询存在问题，但您确实需要设置 db_work_mem。据我所知，注释掉它并不是一个解决方案。

Ed_S · 2024 年5 月 27 日 05:16

我认为，这里的结论是，已接受的拉取请求旨在提高速度，但实际上却使内存最少的网站运行得更慢，甚至导致其损坏？

我们能否撤销该拉取请求，直到我们了解情况？

我们能否将此主题的类别更改为“错误”？

volas · 2024 年5 月 27 日 06:09

感谢你注意到这一点，我被评论误导了。感觉手动将其设置为 512MB 有所帮助，但个人资料页面的性能仍然不够好，有时需要半分钟才能加载。

话题		回复	浏览量
Discourse installation has been getting slower and slower and slower Self-hosting server-resources	37	1668	2023 年5 月 15 日
Users reporting lots of 502 errors when attempting to post due to "max consecutive replies" check Bug	37	2902	2019 年1 月 18 日
Slow Profile Loads with 100GB+ database Feature	43	2948	2021 年1 月 1 日
Long loading times for user summary page with slow database Self-hosting unsupported-install	19	1808	2023 年7 月 13 日
Bursts of 502 Service Unavailable, pointers to debug Self-hosting	16	1644	2024 年6 月 8 日

打开某些用户资料时出现502错误

相关话题