Real-time updating of topics freezes under high activity

The next game is due tomorrow. I have removed our own hacks and we are trying with these.

Also as a total long shot, I have incread the db_shared_buffers from 4GB (25%) to 6GB (37,5%). I also uncommented the db_work_mem 40MB line from app.yml (this is by the way a very vaguely documented option, while still being presented for the admin as a some sort of opportunity for improvement).

I no longer expect to find a solution for the problem, but only better damage control – a set of parameters that has the least negative UX impact for the end users. In the mean time, I’ll have to figure out the possibilities to further increase our hosting resources.

2 Likes

Question to @sam & other developers.

How does the forever growing size of the database impact this use case, where users hammer a single topic for a couple of hours?

I had a look at historical game chat activity and noticed that we had games with huge statistics back in 2017, when our server had a fraction of the resources we have to today. We had games where post counts reached 1600 messages by 165 users and nobody had any complaints about the performance. Now we can’t serve a half of that, with a much more powerful server.

You might try upping it to 80MB. Maybe instead of the other change.

1 Like

This is one point we are actively working on all the time.

When Discourse was new, almost all sites had a brand new database so the database could fit in memory easily. Now, a few years later, some sites have over 100GB databases and RAM sizes that are not even a tenth of that.

One upcoming update in the next few weeks is the PostgreSQL 13 upgrade that will reduce the largest object size in half.

Other than that, step 0 into debugging your performance issues is gathering data with Prometheus exporter plugin for Discourse so we are not flying blind.

8 Likes

@sam

With these settings UX was better. Yes, there were several “chokes” and a bunch of 429’s were recorded in my Chrome inspector. CPU load was low. But then again, it was rather calm home game (many active members were on-site, not chatting).

I can’t name the dials to turn but, from my rather subjective expeeience:

  • The code feature is still over protective on the server load. Perhaps a slightly higher server stress level could be allowed.
  • When the client backs off, the delay is too long from UX perspective. The game goes on and a lot can happen in a minute. The chat goes off sync, with people referring to diffenrent events of the game. (This adds to the problem of different time delays in between real-time vs cable TV vs IPTV vs 20 sec Chromecast buffer etc.)
  • The users only sees that the chat has stalled, but receives no indication that the site is still online and active. He is more likely to refresh the page, or other things, that adds to the high load.

Just to rule things out, I upgraded the server to 8 vCores and 32GB RAM. I set db buffers to 16GB and Unicorns to 16. Other tweaks back to defaults.

Unfortunately the upgrade did not do much. Rapid discussions are constantly freezing, even with mediocre activity.

The performance is miserable nowadays. I guess I need to start looking at Prometheus etc. I am 95% certain that the performance of the software has seriously regressed since v2.3.

Brother @Iceman’s comment was mostly neglected in September. He reports that the chokes happen no matter what hardware he is throwing at it?

I suspect you may be hitting a Redis bottleneck, but as I said many times we can only be sure if you collect those statistics. Without it we may as well use astrology.

If my suspicion is right, it will also explain that throwing more slow cores and RAM at the problem makes no difference, since Redis is single thread you could only scale by getting high performance cores.

We will release a new image with the final release of 2.6 by the end of the month, and it comes with Redis 6 and new app.yml variables to put those to good use. Let me know if you wanna test that earlier, I can give you instructions for that.

5 Likes

Just noticed this on a closed topic. @codinghorror - that is incorrect. What the end user actually gets in a high load situation:

  1. A notification that he is logged out
  2. He is brought to the site index page
  3. The index page has the banner notification of high load

The user is not really logged out though. Usually when one taps back into the active topic, the site will operate as usual.

Yet again, we have no customers reporting this behavior (out of thousands, and many much busier than your site), so further discussion at this point is basically useless – we have no visibility into whatever odd configuration situation or hardware performance strangeness you may have over there.

In the future hopefully that will change and we will have better visibility into the actual problem.

I was only reporting what is the actual UI/UX when the high load situation happens. Nothing else.

The behavior should be that they are kept on the topic page and shown a logged-out view, not brought to the home page.

You are most likely right. It is Redis. The new base image improves things, but now we are exceeding servers capabilities.

Possibly, but that is not how it works in reality. Just reproduced it a minute ago.

1 Like

Well, at least that has a know solution: :moneybag:

3 Likes

Solution: Make leaner and meaner code :wink:

So if Redis is the bottleneck, how would you scale horizontally?

It still puzzles me what has changed since last season. I can’t see that much organic growth, or increase in game chat popularity. Still our capability to serve has reduced dramatically, and is choking even in the calmest games.

Until you can collect metrics on your historic instance of Discourse and then compare to the metrics you collect on your current install, while maintaining the exact same hardware, this will remain a mystery

The whole difference could be that your VPS provider shifted you from one physical machine to another, or that you acquired a noisy neighbour, or that your VPS is now running 17 vs 13 avg number of co hosted services per machine

1 Like

Please do not speculate on pushing the issue to the VPS provider. UpCloud is one of the best on the market, and they have checked their end for anything out of the ordinary. They advertise on our site and it is not very good PR to have the site stuttering :smiley:

But there is no historical data, and TBH I was not paying that much attention as everything just worked, until the first exhibition games took place in August. Of course the behavioral patterns of humans have changed thanks to COVID, and who knows what else. I can’t see it in the metrics of our site or server, though. :man_shrugging:

But this is excellent testing material. Just provided @riking some screenshots on what happens when the server overloading kicks in. I guess you guys don’t see it that often.

1 Like

Note that nobody is disagreeing with you – we’re just pointing out that a doctor can only do so much to diagnose a patient when the doctor is limited to seeing the patient through a video camera on the internet… :movie_camera:

3 Likes

Just wanted to say this was exactly as I experienced when I first set up my site (so its not unique to your site).

Heres a thread I made about it at the time:

This is what caused me to jump up different CPU/Memory options outlined here

Unfortunately, I have not had a chance to properly swap to Hetzner from Digital ocean as I described (started a new job). But will do as soon as I get a chance this month.

The end-user experience of being kicked out of the thread, or remaining in the thread (with the logged our message). Did seem to correlation dependent on load. (more users were sent to site index after a goal scored)

I don’t have enough technical knowledge to be helpful, but felt it might help to know a sporting site with similar peaks of chat like behaviour does lead to a similar issue. But mine (smaller, and younger site) was resolved by further upgrading server.

1 Like

If you’re interested in having data to make decisions about how to diagnose things going forward, you can install Prometheus exporter plugin for Discourse.

Just a brief update:

  • Installed a new 2 container environment on 2 VPS servers (web_only, data).
  • Surprisingly (for me) the web_only server is exhausting, while the data is relatively lightly loaded. Both running a 4x vCore 8GB RAM UpCloud.com plan.
  • Upgraded the web_only to a 6x vCore / 16GB RAM UpCloud.com plan. Increased Unicorns to 18.

Still we are hitting various 429 limiters. The system under high load -mode did not kick in though.