502 error while opening some user profiles

Server has troubles with summary.json of some random users and returns 502 code. Probably started after the last update, but I’m not sure.

I’m not familiar with discourse stack and have difficulty looking for logs. Edited to hide username and ip, time can vary but mostly it’s the same profile page request repeated several times.

From app /var/log/nginx/error.log:

[error] 560#560: *416 upstream prematurely closed connection while reading response header from upstream, client: *.*.*.*, server: _, request: "GET /u/user/summary.json HTTP/2.0", upstream: "http://127.0.0.1:3000/u/user/summary.json", host: "forum.taucetistation.org", referrer: "https://forum.taucetistation.org/u/user"

/shared/log/rails/production_errors.log is empty, and /shared/log/rails/production.log contains only

Started GET "/u/user/summary.json" for *.*.*.* at 2024-05-24 20:26:56 +0000
Processing by UsersController#summary as JSON
  Parameters: {"username"=>"user"}

without “Completed 200 OK” for similar requests.

It seems there are no related logs in unicorn (just connection errors with redis at startup, but it seems successful in the end and not related).

But there is one suspicious request with big duration time from /var/log/postgres/current:

2024-05-24 20:49:12.727 UTC [2919] discourse@discourse LOG:  duration: 95288.368 ms  execute <unnamed>: SELECT replies.user_id, COUNT(*) FROM "posts" INNER JOIN "topics" "topics_posts" ON "topics_posts"."deleted_at" IS NULL AND "topics_posts"."id" = "posts"."topic_id" JOIN posts replies ON posts.topic_id = replies.topic_id AND posts.reply_to_post_number = replies.post_number JOIN topics ON replies.topic_id = topics.id AND topics.archetype <> 'private_message' AND replies.post_type IN (1) WHERE "posts"."deleted_at" IS NULL AND (posts.post_type IN (1,4)) AND "topics"."deleted_at" IS NULL AND (topics.archetype <> 'private_message') AND "topics"."visible" = TRUE AND (topics.category_id IS NULL OR topics.category_id IN (SELECT id FROM categories WHERE NOT read_restricted OR id IN (4,8,20,46,55,60,62,67))) AND "posts"."user_id" = 5318 AND (replies.user_id <> posts.user_id) GROUP BY "replies"."user_id" ORDER BY COUNT(*) DESC LIMIT 6
2024-05-24 20:49:12.728 UTC [2919] discourse@discourse LOG:  could not send data to client: Broken pipe
2024-05-24 20:49:12.729 UTC [2919] discourse@discourse FATAL:  connection to client lost
1 Like

Seems like a problem for every active user profile. Sometimes it’s just takes long time, sometimes it’s quick because of some cache.

And sometimes it’s just out of time and throws 502 error.

We are using this droppled from DO just for forum.

Not much, but was enough for us in the past.

I can see summary.json that takes between 20 and 30 seconds to load.

Also, these errors in the console:

And sometimes 502, indeed:

Looking at the safe mode, it doesn’t look like it is related to plugins/theme components.

I see you’re behind Cloudflare. Can you try to disable any optimizations?

1 Like

Disabled CF proxy.

But it doesn’t look like a frontend problem, and more like heavy DB request at backend

1 Like

our plugin list

hooks:
  after_code:
    - exec:
        cd: $home/plugins
        cmd:
          - git clone https://github.com/discourse/docker_manager.git
          - git clone https://github.com/discourse/discourse-patreon.git
          - git clone https://github.com/discourse/discourse-solved.git
          - git clone https://github.com/discourse/discourse-staff-notes.git
          - git clone https://github.com/discourse/discourse-spoiler-alert.git
          - git clone https://github.com/discourse/discourse-bbcode.git
          - git clone https://github.com/discourse/discourse-signatures.git
          - git clone https://github.com/discourse/discourse-github
          - git clone https://github.com/discourse/discourse-whos-online.git
          - git clone https://github.com/discourse/discourse-chat-integration.git
          - git clone https://github.com/discourse/discourse-assign
          - git clone https://github.com/discourse/discourse-question-answer.git

I’m not an expert, but Discourse tends to require more RAM nowadays, depending on your setup (at least 2 GB). Do you have a swap configured?

Do you see any relevant information in /logs route?

Plugin-wise, I don’t really see faulty plugins at first glance. :thinking:

Hopefully, someone will be able to provide better insights here.

Yep, 4gb swap. And in /logs just some warnings like deprecation notices or something about geolite.

2024-05-24 20:49:12.727 UTC [2919] discourse@discourse LOG:  duration: 95288.368 ms  execute <unnamed>: SELECT replies.user_id, COUNT(*) FROM "posts" INNER JOIN "topics" "topics_posts" ON "topics_posts"."deleted_at" IS NULL AND "topics_posts"."id" = "posts"."topic_id" JOIN posts replies ON posts.topic_id = replies.topic_id AND posts.reply_to_post_number = replies.post_number JOIN topics ON replies.topic_id = topics.id AND topics.archetype <> 'private_message' AND replies.post_type IN (1) WHERE "posts"."deleted_at" IS NULL AND (posts.post_type IN (1,4)) AND "topics"."deleted_at" IS NULL AND (topics.archetype <> 'private_message') AND "topics"."visible" = TRUE AND (topics.category_id IS NULL OR topics.category_id IN (SELECT id FROM categories WHERE NOT read_restricted OR id IN (4,8,20,46,55,60,62,67))) AND "posts"."user_id" = 5318 AND (replies.user_id <> posts.user_id) GROUP BY "replies"."user_id" ORDER BY COUNT(*) DESC LIMIT 6

As I understand it’s request for “Most Replied To”, and it was recently changed in PERF: Improve performance of `most_replied_to_users` by janzenisaac · Pull Request #26373 · discourse/discourse · GitHub, so maybe this patch affecting us.

I did EXPLAIN ANALYZE if it can help:

 Limit  (cost=2552.39..2552.39 rows=1 width=12) (actual time=94412.810..94412.824 rows=6 loops=1)
   ->  Sort  (cost=2552.39..2552.39 rows=1 width=12) (actual time=94412.804..94412.808 rows=6 loops=1)
         Sort Key: (count(*)) DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  GroupAggregate  (cost=2552.36..2552.38 rows=1 width=12) (actual time=94412.361..94412.743 rows=160 loops=1)
               Group Key: replies.user_id
               ->  Sort  (cost=2552.36..2552.36 rows=1 width=4) (actual time=94412.323..94412.405 rows=729 loops=1)
                     Sort Key: replies.user_id
                     Sort Method: quicksort  Memory: 59kB
                     ->  Nested Loop  (cost=12.24..2552.35 rows=1 width=4) (actual time=0.279..94409.497 rows=729 loops=1)
                           Join Filter: (topics_posts.id = topics.id)
                           ->  Nested Loop  (cost=1.13..2507.59 rows=4 width=16) (actual time=0.164..94354.548 rows=746 loops=1)
                                 Join Filter: ((replies.user_id <> posts.user_id) AND (posts.reply_to_post_number = replies.post_number))
                                 Rows Removed by Join Filter: 15395502
                                 ->  Nested Loop  (cost=0.71..1419.07 rows=131 width=16) (actual time=0.113..240.052 rows=1399 loops=1)
                                       ->  Index Scan using idx_posts_user_id_deleted_at on posts  (cost=0.42..561.22 rows=135 width=12) (actual time=0.077..59.712 rows=1408 loops=1)
                                             Index Cond: (user_id = 5318)
                                             Filter: (post_type = ANY ('{1,4}'::integer[]))
                                       ->  Index Only Scan using index_topics_on_id_and_deleted_at on topics topics_posts  (cost=0.29..6.35 rows=1 width=4) (actual time=0.123..0.123 rows=1 loops=1408)
                                             Index Cond: ((id = posts.topic_id) AND (deleted_at IS NULL))
                                             Heap Fetches: 1394
                                 ->  Index Scan using index_posts_on_topic_id_and_reply_to_post_number on posts replies  (cost=0.42..7.77 rows=36 width=12) (actual time=0.098..65.083 rows=11005 loops=1399)
                                       Index Cond: (topic_id = topics_posts.id)
                                       Filter: (post_type = 1)
                                       Rows Removed by Filter: 4
                           ->  Index Scan using topics_pkey on topics  (cost=11.11..11.18 rows=1 width=4) (actual time=0.065..0.065 rows=1 loops=746)
                                 Index Cond: (id = replies.topic_id)
                                 Filter: ((deleted_at IS NULL) AND visible AND ((archetype)::text <> 'private_message'::text) AND ((archetype)::text <> 'private_message'::text) AND ((category_id IS NULL) OR (hashed SubPlan 1)))
                                 Rows Removed by Filter: 0
                                 SubPlan 1
                                   ->  Seq Scan on categories  (cost=0.00..10.74 rows=31 width=4) (actual time=0.012..0.066 rows=37 loops=1)
                                         Filter: ((NOT read_restricted) OR (id = ANY ('{4,8,20,46,55,60,62,67}'::integer[])))
 Planning Time: 3.895 ms
 Execution Time: 94416.415 ms
(34 rows)

Might be handy to have
vmstat 5
running on the server (in an ssh connection) while you do this, to see what might be going on with paging. (top can be used but is much less useful here)

3 Likes

How much ram do you have? How big is your database? It’s likely that you just need more ram.

2 Likes

It seems like postgresql does some optimizations/cache, so sometimes it doesn’t take too much time, so I need to look for new users to test requests.

procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st gu
 0  0 287852  75220  17660 587788    6    8  4685   244  496    9  6  3 88  3  0  0
 0  0 287852  74048  17668 587836    0    0     0    18  209  465  1  2 96  0  0  0
 0  1 287852  82496  13452 582360    2   13 39937    38 4186 7339  8 16 14 62  0  0
 0  1 288620  77924   1020 601604    2  138 64178   166 5352 8263  9 22  0 68  0  0
 1  0 290156  80472    968 599840    1  313 37518    42 3612 5160 22 37  0 41  0  0
 1  0 289900  67588   3720 609896    4   29 13800    75 2062 2635 43 49  0  8  0  0
 1  0 291180  76816   2368 606924    0  245 30042    27 3414 5477 34 39  0 27  0  0
 2  0 292716  67184   1696 622528    4  318 44820    18 4152 6198 23 30  0 47  0  0
 1  0 293228  75100   3008 614144    0  132 10030     4 1803 1986 39 49  0 11  0  0
 1  0 293484  76020   2860 613444    1   34  3331    10 1197  767 44 54  0  2  0  0
 1  0 294508  66296   2868 623376    2  218  4994    24 1389 1206 40 54  0  6  0  0
 1  0 295020  75432   2892 614384    0  120  1304    58 1161  728 43 55  0  2  0  0
 1  0 295020  73084   2904 615984    0    0   314    11 1060  448 43 57  0  0  0  0
 1  0 295020  68864   3148 619388    6   33  8209    63  663 1190 10 13 74  3  0  0
   Name    |  Owner   | Encoding | Locale Provider |   Collate   |    Ctype    | ICU Locale | ICU Rules |   Access privileges    |  Size   | Tablespace | Description 
-----------+----------+----------+-----------------+-------------+-------------+------------+-----------+------------------------+---------+------------+-------------
 discourse | postgres | UTF8     | libc            | ru_RU.UTF-8 | ru_RU.UTF-8 |            |           | =Tc/postgres          +| 9493 MB | pg_default | 

2gb ram + 4gb swap, we using this (legacy CPU) droplet from DigitalOcean

I want to try to test it with reverted changes from PERF: Improve performance of `most_replied_to_users` (#26373) · discourse/discourse@db10dd5 · GitHub, but despite me editing app/models/user_summary.rb while in discourse container, I still see that profile page does this request with three JOIN’s.

Do I need to clean cache somewhere? Or is there a better way to do it.

UPD: Container restart helped, and it seems like it works better now after revert! Profile pages load faster and no 502 errors.

3 Likes

I confirm; this is blazingly fast now, and I don’t see any errors.

2 Likes

Thanks for the vmstat output. For me, that says that memory is not overcommitted - linux is not swapping - but the database hasn’t been able to allocate enough RAM - it’s doing lots of disk I/O.

So I think this would be a case of more RAM needed, and adding swap wouldn’t much help.

So you too have reverted the change and found performance returned to normal?

1 Like

Not me, Alexander, in his above post. I wanted to confirm it, since I was testing earlier on his forum. And yes, the performance issue is gone from reverting that commit.

2 Likes

Maybe also CPU problem. But except when discourse was running some maintenance jobs, we almost never had any problems with performance before. Just now with profile page, so I was thinking on the last update.

But I should say that before trying revert I also tweaked app.yml:

  1. For some reason we were having db_shared_buffers: "128MB", four times less that recommended, probably some legacy from time when I was setting forum first time. I have commented this out, so discourse can setup it himself based on host RAM.
  2. Also I have uncommented db_work_mem: "40MB"

But I did a proper rebuild and tested this first, and it didn’t help us. Only after revert it became better.

Just now tested again without revert just in case, and I can see 502 on profiles again.

If needed I can do more tests with/without patch.

1 Like

Good thorough testing!

Maybe, but looking at the time spent in Wait, I think it’s mostly about I/O. Once the I/O bottleneck is sorted out, vmstat can show if CPU is maxed out.

2 Likes

It won’t. You can run discourse-setup and it will do it (you can control-c after the first message) or enter the correct number yourself.

It’s often helpful to increase this, but only if you have enough ram.

It sounds like your original issuemight be a problem with the query in the source, but you definitely want to set db_work_mem. Commenting it out isn’t a solution (last I knew)

2 Likes

The conclusion here, I think, is that the accepted pull request, intended to make things faster, has in fact made things slower, or even broken things, on sites with minimal RAM?

Can we get that pull request reverted until this is understood?

Can we change the category of this topic to bug?

1 Like

Thanks for noticing this, I was mislead by comment then. Feels like setting this to 512MB manually helped a little, but performance on profile pages is still not good enough and sometimes takes half a minute to load.

1 Like