With these settings UX was better. Yes, there were several “chokes” and a bunch of 429’s were recorded in my Chrome inspector. CPU load was low. But then again, it was rather calm home game (many active members were on-site, not chatting).
I can’t name the dials to turn but, from my rather subjective expeeience:
The code feature is still over protective on the server load. Perhaps a slightly higher server stress level could be allowed.
When the client backs off, the delay is too long from UX perspective. The game goes on and a lot can happen in a minute. The chat goes off sync, with people referring to diffenrent events of the game. (This adds to the problem of different time delays in between real-time vs cable TV vs IPTV vs 20 sec Chromecast buffer etc.)
The users only sees that the chat has stalled, but receives no indication that the site is still online and active. He is more likely to refresh the page, or other things, that adds to the high load.
Just to rule things out, I upgraded the server to 8 vCores and 32GB RAM. I set db buffers to 16GB and Unicorns to 16. Other tweaks back to defaults.
Unfortunately the upgrade did not do much. Rapid discussions are constantly freezing, even with mediocre activity.
The performance is miserable nowadays. I guess I need to start looking at Prometheus etc. I am 95% certain that the performance of the software has seriously regressed since v2.3.
Brother @Iceman’s comment was mostly neglected in September. He reports that the chokes happen no matter what hardware he is throwing at it?
I suspect you may be hitting a Redis bottleneck, but as I said many times we can only be sure if you collect those statistics. Without it we may as well use astrology.
If my suspicion is right, it will also explain that throwing more slow cores and RAM at the problem makes no difference, since Redis is single thread you could only scale by getting high performance cores.
We will release a new image with the final release of 2.6 by the end of the month, and it comes with Redis 6 and new app.yml variables to put those to good use. Let me know if you wanna test that earlier, I can give you instructions for that.
Yet again, we have no customers reporting this behavior (out of thousands, and many much busier than your site), so further discussion at this point is basically useless – we have no visibility into whatever odd configuration situation or hardware performance strangeness you may have over there.
In the future hopefully that will change and we will have better visibility into the actual problem.
So if Redis is the bottleneck, how would you scale horizontally?
It still puzzles me what has changed since last season. I can’t see that much organic growth, or increase in game chat popularity. Still our capability to serve has reduced dramatically, and is choking even in the calmest games.
Until you can collect metrics on your historic instance of Discourse and then compare to the metrics you collect on your current install, while maintaining the exact same hardware, this will remain a mystery
The whole difference could be that your VPS provider shifted you from one physical machine to another, or that you acquired a noisy neighbour, or that your VPS is now running 17 vs 13 avg number of co hosted services per machine
Please do not speculate on pushing the issue to the VPS provider. UpCloud is one of the best on the market, and they have checked their end for anything out of the ordinary. They advertise on our site and it is not very good PR to have the site stuttering
But there is no historical data, and TBH I was not paying that much attention as everything just worked, until the first exhibition games took place in August. Of course the behavioral patterns of humans have changed thanks to COVID, and who knows what else. I can’t see it in the metrics of our site or server, though.
But this is excellent testing material. Just provided @riking some screenshots on what happens when the server overloading kicks in. I guess you guys don’t see it that often.
Note that nobody is disagreeing with you – we’re just pointing out that a doctor can only do so much to diagnose a patient when the doctor is limited to seeing the patient through a video camera on the internet…
Just wanted to say this was exactly as I experienced when I first set up my site (so its not unique to your site).
Heres a thread I made about it at the time:
This is what caused me to jump up different CPU/Memory options outlined here
Unfortunately, I have not had a chance to properly swap to Hetzner from Digital ocean as I described (started a new job). But will do as soon as I get a chance this month.
The end-user experience of being kicked out of the thread, or remaining in the thread (with the logged our message). Did seem to correlation dependent on load. (more users were sent to site index after a goal scored)
I don’t have enough technical knowledge to be helpful, but felt it might help to know a sporting site with similar peaks of chat like behaviour does lead to a similar issue. But mine (smaller, and younger site) was resolved by further upgrading server.
Installed a new 2 container environment on 2 VPS servers (web_only, data).
Surprisingly (for me) the web_only server is exhausting, while the data is relatively lightly loaded. Both running a 4x vCore 8GB RAM UpCloud.com plan.
Upgraded the web_only to a 6x vCore / 16GB RAM UpCloud.com plan. Increased Unicorns to 18.
Still we are hitting various 429 limiters. The system under high load -mode did not kick in though.
The hockey season is ruined by the COVID, and they are now playing a few random games without audience. Since we do have hosting credits with UpCloud.com, we are pushing to improve the experience using what we got. Now running the 6x vCore 16GB for web_only and 4x vCore 8GB for data, unicorns at 18.
We once again disabled the ratelimiter…
DISCOURSE_MAX_REQS_PER_IP_MODE : none
…which helps, but we still get 429’s from POLLs, that produce the long delay/freeze for the end user. We are going to continue tweaking by increasing the DISCOURSE_REJECT_MESSAGE_BUS_QUEUE_SECONDS.
But before we do that, a question to @sam / staff:
Is there an environtment variable to increase the thresold for extreme load - read only mode -limiter, or can it be disabled completely?
Perhaps so, but we would like to be slightly less protective over the server as the naturally occuring activity spikes are very short, and generally stabilize within a minute or so. So adjusting the thresolds just a little bit higher might improve the UX, while waiting for the move.
The games have been scarce (thanks to COVID), so we have had very few opportunities to measure and tinker with this.
What we found out that even with our improved hardware resources (6+4 vCores and 16+8GB RAM), even a modestly active crowd is able to produce 429 client freezes. We saw this with the U20 WC games, that attracted about ~50% of our regular game audience for the chats.
With measuring, trial and error we have settled with the following tweaks:
This seems to eliminate 80% of the 429’s, thus enabling a relatively smooth experience for a majority of users.
The next step would have been buying different kind of hardware resources, either using dedicated boxes for single threaded speed or switching to a VPS provider that offers plans with gazzillion vCores. For us however, the next step is to work with the Discourse hosting team, as @sam hinted earlier.
Hopefully these tweaks might be useful for @iceman, @alec or anyone else. Be sure to have an eye on the CPU usage and queuing. Also what I learned from this exercise, is that 2 containers are way better than one - tweaks can be applied with near zero downtime, and hardware resources exploited more granulary.
I am still interested in any new tweaks or findings that might help to improve the performance/UX for fast paced discussions driven by real world events.