Real-time updating of topics freezes under high activity

@sam

With these settings UX was better. Yes, there were several “chokes” and a bunch of 429’s were recorded in my Chrome inspector. CPU load was low. But then again, it was rather calm home game (many active members were on-site, not chatting).

I can’t name the dials to turn but, from my rather subjective expeeience:

  • The code feature is still over protective on the server load. Perhaps a slightly higher server stress level could be allowed.
  • When the client backs off, the delay is too long from UX perspective. The game goes on and a lot can happen in a minute. The chat goes off sync, with people referring to diffenrent events of the game. (This adds to the problem of different time delays in between real-time vs cable TV vs IPTV vs 20 sec Chromecast buffer etc.)
  • The users only sees that the chat has stalled, but receives no indication that the site is still online and active. He is more likely to refresh the page, or other things, that adds to the high load.

Just to rule things out, I upgraded the server to 8 vCores and 32GB RAM. I set db buffers to 16GB and Unicorns to 16. Other tweaks back to defaults.

Unfortunately the upgrade did not do much. Rapid discussions are constantly freezing, even with mediocre activity.

The performance is miserable nowadays. I guess I need to start looking at Prometheus etc. I am 95% certain that the performance of the software has seriously regressed since v2.3.

Brother @Iceman’s comment was mostly neglected in September. He reports that the chokes happen no matter what hardware he is throwing at it?

I suspect you may be hitting a Redis bottleneck, but as I said many times we can only be sure if you collect those statistics. Without it we may as well use astrology.

If my suspicion is right, it will also explain that throwing more slow cores and RAM at the problem makes no difference, since Redis is single thread you could only scale by getting high performance cores.

We will release a new image with the final release of 2.6 by the end of the month, and it comes with Redis 6 and new app.yml variables to put those to good use. Let me know if you wanna test that earlier, I can give you instructions for that.

3 Likes

Just noticed this on a closed topic. @codinghorror - that is incorrect. What the end user actually gets in a high load situation:

  1. A notification that he is logged out
  2. He is brought to the site index page
  3. The index page has the banner notification of high load

The user is not really logged out though. Usually when one taps back into the active topic, the site will operate as usual.

Yet again, we have no customers reporting this behavior (out of thousands, and many much busier than your site), so further discussion at this point is basically useless – we have no visibility into whatever odd configuration situation or hardware performance strangeness you may have over there.

In the future hopefully that will change and we will have better visibility into the actual problem.

I was only reporting what is the actual UI/UX when the high load situation happens. Nothing else.

The behavior should be that they are kept on the topic page and shown a logged-out view, not brought to the home page.

You are most likely right. It is Redis. The new base image improves things, but now we are exceeding servers capabilities.

Possibly, but that is not how it works in reality. Just reproduced it a minute ago.

1 Like

Well, at least that has a know solution: :moneybag:

3 Likes

Solution: Make leaner and meaner code :wink:

So if Redis is the bottleneck, how would you scale horizontally?

It still puzzles me what has changed since last season. I can’t see that much organic growth, or increase in game chat popularity. Still our capability to serve has reduced dramatically, and is choking even in the calmest games.

Until you can collect metrics on your historic instance of Discourse and then compare to the metrics you collect on your current install, while maintaining the exact same hardware, this will remain a mystery

The whole difference could be that your VPS provider shifted you from one physical machine to another, or that you acquired a noisy neighbour, or that your VPS is now running 17 vs 13 avg number of co hosted services per machine

1 Like

Please do not speculate on pushing the issue to the VPS provider. UpCloud is one of the best on the market, and they have checked their end for anything out of the ordinary. They advertise on our site and it is not very good PR to have the site stuttering :smiley:

But there is no historical data, and TBH I was not paying that much attention as everything just worked, until the first exhibition games took place in August. Of course the behavioral patterns of humans have changed thanks to COVID, and who knows what else. I can’t see it in the metrics of our site or server, though. :man_shrugging:

But this is excellent testing material. Just provided @riking some screenshots on what happens when the server overloading kicks in. I guess you guys don’t see it that often.

1 Like

Note that nobody is disagreeing with you – we’re just pointing out that a doctor can only do so much to diagnose a patient when the doctor is limited to seeing the patient through a video camera on the internet… :movie_camera:

2 Likes