Timeouts on large, old forum

Hi there

I’ve recently migrated a large and old forum across to discourse (900,000 posts over 10 years).

Everything has migrated properly and for the most part everything is working as expected.

However in the last few days people started to get promoted to TL2 and found that they couldn’t log in.

When I navigated to the TL2 group and selected to “View Activity” the forum would process for 30 seconds then thrown an “unknown error” or “502 error” on the groups posts.json

The unicorn logs showed that the process was being killed because it was running for more than 30 seconds.

Individual users within the TL2 group could have their “Activity” viewed, but I found a couple of users who had over 30,000 posts each and their activity page threw the same 502 error on posts.json

It seams that this issues is either:

A. There’s something special about TL2 that causes this error to be thrown
B. There literally are too many posts to process within the 30 second limit
C. Somewhere one of this users posts is causing an issue (although it seems to be limited to TL2 group only)

What are some suggestions on how I should resolve this?

Should I try to increase the unicorn worker’s timeout? If so, how?
Should I prune posts? That would suck :frowning:

I appreciate any help that could be given

We’re going to need server perf info, disk performance, memory stats, db size info, etc. Large database needs fast disks and lots of memory.

3 Likes

We had a severe perf bug on this route that was fixed a few months ago because of a customer was impacted. Are you running latest Discourse?

4 Likes

Yes, latest

(As of a day ago)

I’ll compile the server stats and reply shortly

Also make sure your user email is set as the developer on the app.yml. This will let you see the page timings on page load in the upper right corner.

Using that try to load those slow pages (if activity for tl2 never loads maybe try tl3) and share the contents of the timings with us.

3 Likes

Latest build.
It only started happening about 5 days ago when I ran ./launcher rebuld app to get my SSL certs to refresh, so it’s on the latest test-passed build.

top - 10:07:52 up 20 days,  4:23,  1 user,  load average: 1.00, 1.08, 1.13
Tasks: 197 total,   3 running, 194 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.9 us,  4.9 sy,  0.2 ni, 87.3 id,  0.5 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem :  2323192 total,    53596 free,  1462364 used,   807232 buff/cache
KiB Swap:  2097148 total,  1423420 free,   673728 used.   418652 avail Mem 

Server specs

RAM: 2560MB
CPU: 2 Cores Xeon 4114 @ 2.20
SSD: 40GB

DB stats:

table_name                  | row_estimate | size      
----------------------------------------------------
posts                       | 966206       | 1895 MB   
post_search_data            | 1015948      | 916 MB    
post_timings                | 1337287      | 86 MB     
user_actions                | 1026939      | 77 MB     
permalinks                  | 947149       | 74 MB     
post_custom_fields          | 917466       | 73 MB     
post_stats                  | 917491       | 66 MB     
topic_users                 | 507792       | 48 MB     
topics                      | 41478        | 40 MB     
user_auth_token_logs        | 68343        | 35 MB 

@Falco the page load time on “slow” pages is small, only about 180ms.

The “activity” tab will load for 60 seconds before getting a 502 error.
If i drill into specific users it works, except for those users who have a huge number of posts, where this “502 error” is thrown.

Again, the Unicorn logs show that the process reaches the max time and it’s killed.

I can understand if it’s hitting this limit on page that is trying to load stats for a user with a lot of posts, but I don’t understand why the same issue is occuring when users try to log in?

1 Like

Quick update:

After some more testing I found that the login issue was being caused by the https redirect breaking.
I’ve applied a 302 to redirect users to https and that appears to have resolved the issue.

As expected, this was unrelated to the 502 issue on the TL2 group’s activity.

That is automatic when you tick the force_https site setting. If you don’t, you had a pending warning in the admin dashboard telling you so.

Please provide the requested timings, and try the activity of other groups.

3 Likes

After bumping the RAM to 4GB and extending the request time to 60s most users have had this issue resolved, however for the few users that have more than 20,000 posts, they are still getting 504 errors about 50% of the time when they try to post. I’d love to run a SQL trace to try to determine what query it is trying to run… perhaps it’s trying to update the users post count or something?

Clearly it’s a resource issue, however I can’t justify spending more on resources.

My next thought is to prune posts. With over 1million posts going back 10 years, perhaps some can go.

Is there a set API call to clear out posts from a set time?
eg. DELETE FROM posts WHERE created_at <= ‘2014-12-31’

2 Likes
  • This is specific to a certain set of users with lots of posts? It does not happen for, say, a new user you create?

  • Are you on the very latest version of Discourse, e.g. 2.3 beta 5?

  • Are you running any third party plugins? If so, disable those and rebuild.

1 Like

Thanks Jeff.

Yes, it affects users who have a large number of posts. They all have 30,000 plus.

I’ve disabled all 3rd party plugins and tried to get it down to the bare minimum (docker manager, adsense and data explorer). After rebuilding I seem to be able to now get a bit further when drilling into one of the +30k users’s profile - before it would crash with a 502 error when simply trying to look at activity.

I am hoping that this is a sign that their experience posting will be improved too. I’ve asked them to carry out tests and I’ll get back to you.

2 Likes