If you’re interested in having data to make decisions about how to diagnose things going forward, you can install Prometheus exporter plugin for Discourse.
Just a brief update:
- Installed a new 2 container environment on 2 VPS servers (web_only, data).
- Surprisingly (for me) the web_only server is exhausting, while the data is relatively lightly loaded. Both running a 4x vCore 8GB RAM UpCloud.com plan.
- Upgraded the web_only to a 6x vCore / 16GB RAM UpCloud.com plan. Increased Unicorns to 18.
Still we are hitting various 429 limiters. The system under high load -mode did not kick in though.
The hockey season is ruined by the COVID, and they are now playing a few random games without audience. Since we do have hosting credits with UpCloud.com, we are pushing to improve the experience using what we got. Now running the 6x vCore 16GB for web_only and 4x vCore 8GB for data, unicorns at 18.
We once again disabled the ratelimiter…
DISCOURSE_MAX_REQS_PER_IP_MODE : none
…which helps, but we still get 429’s from POLLs, that produce the long delay/freeze for the end user. We are going to continue tweaking by increasing the DISCOURSE_REJECT_MESSAGE_BUS_QUEUE_SECONDS.
But before we do that, a question to @sam / staff:
Is there an environtment variable to increase the thresold for extreme load - read only mode -limiter, or can it be disabled completely?
This should not be needed, we would love to host you so we can get to the bottom of why this keeps tripping you even though you have such low traffic.
Perhaps so, but we would like to be slightly less protective over the server as the naturally occuring activity spikes are very short, and generally stabilize within a minute or so. So adjusting the thresolds just a little bit higher might improve the UX, while waiting for the move.
The games have been scarce (thanks to COVID), so we have had very few opportunities to measure and tinker with this.
What we found out that even with our improved hardware resources (6+4 vCores and 16+8GB RAM), even a modestly active crowd is able to produce 429 client freezes. We saw this with the U20 WC games, that attracted about ~50% of our regular game audience for the chats.
With measuring, trial and error we have settled with the following tweaks:
DISCOURSE_REJECT_MESSAGE_BUS_QUEUE_SECONDS: 0.4 DISCOURSE_MAX_REQS_PER_IP_PER_MINUTE: 400 DISCOURSE_MAX_REQS_PER_IP_PER_10_SECONDS: 100
This seems to eliminate 80% of the 429’s, thus enabling a relatively smooth experience for a majority of users.
The next step would have been buying different kind of hardware resources, either using dedicated boxes for single threaded speed or switching to a VPS provider that offers plans with gazzillion vCores. For us however, the next step is to work with the Discourse hosting team, as @sam hinted earlier.
Hopefully these tweaks might be useful for @iceman, @alec or anyone else. Be sure to have an eye on the CPU usage and queuing. Also what I learned from this exercise, is that 2 containers are way better than one - tweaks can be applied with near zero downtime, and hardware resources exploited more granulary.
I am still interested in any new tweaks or findings that might help to improve the performance/UX for fast paced discussions driven by real world events.
We just landed a big performance improvement for sites with active posting and many users yesterday, it should help a lot on your site.
Very good, we’ll have a look and potentially test this.
Well, every game is an individual case. Now in the COVID situation (empty arena) and near-random game schedule the behavior of the audience is impossible to predict or compare to historical data.
Based on this single game, I can’t say that this change brought us significant improvement.
The 1st perioid was calm and fine, but events during the second caused a spike in messages and increase in lurkers. About 60% of our people said they experienced freezes.
In the two server setup, the web_only is the only one reporting high CPU usage and load average.
The extreme load / read-only -mode was not triggered, which is good as it is the most painful UX. Overall the audience has quickly learned to visit the index page and come back to resume the discussion - which generates more server load. If only the end user could somehow be informed that he is being throttled. Then he would be more likely to actually wait a minute.
Progress report from the private conversations: the experience was improved by setting
DISCOURSE_REJECT_MESSAGE_BUS_QUEUE_SECONDS to 4, and we are planning some core changes to improve the message bus ratelimiting behavior.
Since we experience some similarities with @ljpp situation, although in a significantly lesser extension (almost exclusively in some 5 minutes around the time when the matches finish), I would like to know if there are any tweaks one can make to the threshold where the message of high load kicks in and users start being “kicked out” of the topic… because it’s always around one single topic, the match topic.
That and the 502 error message (purely nginx message) that we experience even more rarely in the same context. I suspect there are probably some configs in nginx that might benefit from some tuning, and I know it’s not your job, but I’m all ears for good suggestions .
Please clarify - are you experiencing freezes (topic not updated for new posts) or are you getting extreme load error messages?
There are tweaks in this thread that provide some improvements for the freezing, but they also increase the system load, so you are more likely to get extreme load scenarios.
We experience sometimes topic freezing in those situations I’ve reported, but when that happens the system also shows warnings of extreme load. So I can’t tell you what is what.
We don’t mind extreme load as long as it doesn’t kick people from topics or interrupt update for new posts. We would actually prefer in that case to have it slowly loading stuff (the wheel could spin for 15 seconds for each user to read/post and we would prefer that to freezing or user being kicked out).
I have to agree. The extreme load UX is confusing for the end user.
- How many concurrent users you have?
- What kind of hardware?
- Link to your forum stats?
As we are now on the CDCK SaaS platform, I can only observe this from the UX point of view.
We have had some good heat in the games during the last couple of weeks. The “freezes” have pretty much disappeared with the platform change, but there is this fluxuation in the way that the topic gets updated, which may still be confusing to some. But the audience has mostly (90%) stopped complaining and is focusing on the games, which is a good sign.
There is however a scenario which I can reproduce with fairly high (again 90%) confidence. The platform has occasional issues in resuming the session, when the game topic is in a background tab (Android) or under a locked screen. When I get back to the busy topic, usually due to an interesting event in the game, the topic view is not sometimes updated. I can see user avatars blinking at the bottom of the topic, but no posts are appearing. One needs to refresh the browser to fully recover.
The repro pattern is not the easiest, as you need:
- A busy topic
- Some good action in the game → more heat to the topic
- Keep the topic under locked screen or on the background browser tab.
We suffer from that too.
Another thing is, when jumping to the first unread post, it can repeat this behaviour a few times (going to the same “unread post” a few times, although the first unread post position should have changed in each occasion).
- I jump to the first unread post
- scroll and read the 100 unread posts
- then go to another topic or homepage…
- after a minute or so, there are like 30 new unread posts, but when I click on the icon, I’m thrown once gain to the position on 1 (meaning 130 posts backwards and not just the new unread 30).
But, once more, it only happens in very very busy topics during some minutes at the greatest peak of refresh and posting by every user all in the same topic at the same time. Kind of annoying but not a dealbreaker so far.
I would consider that a success.
Can you provide a repro here on meta? Probably not since it requires a large number of active users idling in the same topic at the same time?
My current thinking is we should build a live chat feature and instantiate it just-in-time, when you have…
lots of users
in the same topic
at the same time
then, and only then, instantiate a live chat box overlay and strongly push users into using that instead of replies, maybe even disable the ability to reply to the topic with
Hey, it looks like what you really wanted was a chatroom… here it is, have fun!
Yeah, I know what you mean, but it’s so limited to those occasions that I guess it’s not worth the effort. We usually have matches like that once to twice a week and it’s mostly at the 5 minute period as soon as the match finishes. But I’ve actually thought about it several times (that it would be nice to have a temporary chatroom function or switching thing to those 90-minute period of a football match).
Still, I’ll try to repro one of these days by recording the screen for a while.
Our instance has been showing some 429’s, as the playoff games have started. @staff should be able to see some in the last 3,5 hours of our logs, and more expected when the deciding goal is scored (game is going to second OT as I type this).
I anycase, if you are still logging and tracing this there are not many opportunities left, as the finals and the following off-season is getting close.
I just wanted to add my name to the thread here so I can follow this. We are a new gymnastics forum. We experienced the above along with “freezing” last night during US Olympic Trials. Here is the thread…
We had 4 unicorns last night.
I resized the server to 4 Intel vCPUs & 8 GB memory at Digital Ocean and did…
We are expecting much higher traffic during the Olympics. What else can we do to optimize the server for “chat like” traffic during the competition?