Intermittent performance issues

Just wondering if anything has changed in the last few updates that might impact performance?

Last week I noticed a change in fetching pages and other areas.

At times the loading animation between pages seemed to be longer than usual (other times pages are loading blazingly fast). I’m also noticing a delay when clicking to see who’s liked a post (again intermittently). Sometimes when liking posts it tells me there has been an error. I’ve also experienced a ‘delay’ when editing and then submitting posts. All of these were rare occurrences previously.

Whenever I experience any kind of issues (or suspect any issues) on any site, I generally restart my router/modem and things usually rectify themselves, however I have done that several times to no avail (and am not experiencing any issues on other sites).

We did move to a new server just over a month ago, but the only real difference there is we’re now using the recommended Overlay2 driver instead of DeviceMapper (although I never experienced any problems with DeviceMapper and always found it to be very snappy). This server is actually more powerful than the last, with 64GB ECC ram and 2x 512GB NVMe SSDs in a raid array (the previous server just had standard HDs and less ram and non-ecc memory).

The only other thing I can think of is I installed the Who's Online Plugin (discourse-whos-online) (and then later removed it), but I don’t think that would have any impact?

If it helps, here are some top stats:

top - 19:29:33 up 32 days, 18:31,  1 user,  load average: 0.55, 0.42, 0.50
Tasks: 284 total,   1 running, 283 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.4 us,  2.8 sy,  0.0 ni, 91.1 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
KiB Mem : 65587284 total, 17147544 free,  6181188 used, 42258552 buff/cache
KiB Swap: 33521660 total, 33292028 free,   229632 used. 52035096 avail Mem 

I’ve also had a ping running in the background, and after over 3000 they are mostly in 30 or 40ms with none exceeding 41ms except for one (236.372 ms)

64 bytes from IP: icmp_seq=2897 ttl=50 time=40.367 ms
64 bytes from IP: icmp_seq=2898 ttl=50 time=38.607 ms
64 bytes from IP: icmp_seq=2899 ttl=50 time=40.869 ms
64 bytes from IP: icmp_seq=2900 ttl=50 time=40.537 ms
64 bytes from IP: icmp_seq=2901 ttl=50 time=38.994 ms
64 bytes from IP: icmp_seq=2902 ttl=50 time=40.600 ms
64 bytes from IP: icmp_seq=2903 ttl=50 time=40.662 ms
64 bytes from IP: icmp_seq=2904 ttl=50 time=38.771 ms

Just for comparison, pinging meta.discourse I get the following (though most are around 150ms - which is what I expect as I’m based in the UK):


64 bytes from 54.215.176.112: icmp_seq=1284 ttl=226 time=162.893 ms
64 bytes from 54.215.176.112: icmp_seq=1285 ttl=226 time=390.069 ms
64 bytes from 54.215.176.112: icmp_seq=1286 ttl=226 time=154.971 ms
64 bytes from 54.215.176.112: icmp_seq=1287 ttl=226 time=230.228 ms
64 bytes from 54.215.176.112: icmp_seq=1288 ttl=226 time=453.995 ms
64 bytes from 54.215.176.112: icmp_seq=1289 ttl=226 time=154.921 ms
64 bytes from 54.215.176.112: icmp_seq=1290 ttl=226 time=290.748 ms
64 bytes from 54.215.176.112: icmp_seq=1291 ttl=226 time=210.910 ms
64 bytes from 54.215.176.112: icmp_seq=1292 ttl=226 time=154.862 ms
64 bytes from 54.215.176.112: icmp_seq=1293 ttl=226 time=356.475 ms
64 bytes from 54.215.176.112: icmp_seq=1294 ttl=226 time=275.453 ms
64 bytes from 54.215.176.112: icmp_seq=1295 ttl=226 time=155.395 ms
64 bytes from 54.215.176.112: icmp_seq=1296 ttl=226 time=420.115 ms
64 bytes from 54.215.176.112: icmp_seq=1297 ttl=226 time=339.067 ms
64 bytes from 54.215.176.112: icmp_seq=1298 ttl=226 time=154.849 ms
64 bytes from 54.215.176.112: icmp_seq=1299 ttl=226 time=154.835 ms
64 bytes from 54.215.176.112: icmp_seq=1300 ttl=226 time=403.250 ms
64 bytes from 54.215.176.112: icmp_seq=1301 ttl=226 time=154.982 ms

Any ideas on how to troubleshoot this or carry out further tests?
(I’m going to try a rebuild of the instance later, and if that doesn’t help, try rebooting the server)

If this is dedicated infrastructure, first thing I would recommend is making unicorn count = say 1.5x number of cores and giving postgres more memory.

I would also recommend running miniprofiler so you can quickly figure out if this is network time vs application time.

image

If that number shows 7000ms x1… your server is choking. If that shows 200ms x 1 on one of the erratic slow pages you have a network problem.

3 Likes

Thanks for the reply Sam :slightly_smiling_face:

Re the miniprofilier - they’re all low, less than 200ms, and one such slow page took 3 seconds to load and it was only 74ms in the profiler! I’ve just connected to my phones 4G hotspot and it ‘seems’ better - but the sites are a lot quieter this time of night (it’s midnight here). I will try it again tomorrow in the day.

Re the infrastructure - it’s a dedicated server, however there are other sites on it (with HAProxy on the front) shall I go ahead and increase the unicorn count to 12 anyway? (Currently at 8.) The other sites on this server are not hugely busy.

Re Postgres, do I need to tune the postures on the server, or the one in the forum’s container? If so any tips on what to adjust? As you can see from Top, there is over 50GB of idle ram so plenty to play with :smile:

Yeah this is not our department then :pleading_face:

Sounds like something is going on between your computer and the the docker container running Discourse.

3 Likes

I think it’s my ISP. Sending a PM gave me ‘lost connection’ and so I have been on my phone’s 4G hotspot for a while now with no problems at all.

Thanks for your help Sam and sorry it was a false alarm :relaxed:

I’d still be interested in tuning for performance though - have you written any blog posts or guides here on the forum that might have some info? I have 50GB of ram idle and that seems like an awful waste :joy:

If your database is already all sitting in ram and there is a unicorn per CPU, no much more we can do with all these resources

You can try running discourse-doctor to see if it has any suggestions I guess.

3 Likes

I ran DD on install, and it suggested the 8 workers and ram, here are the settings it generated:

UNICORN_WORKERS: 8
db_shared_buffers: "4096MB"
#db_work_mem: "40MB"

Shall I stick to that or is it worth experimenting with anything? Uncomment the last one perhaps?

I’m curious to see if we can make it even snappier :heart_eyes: :blush:

2 Likes

My general advice here is that as long as you are measuring, go wild and experiment :wink:

3 Likes

Hi Sam, both my ISP and datacenter have said they are not aware of any network issues, and earlier today I got a 429 Too Many Requests page a couple of times. So I think it might be related to this Troubleshooting a 429 (rate limit)

Looking at my app.yml I had the following:

run:
  - replace:
     filename: "/etc/nginx/conf.d/discourse.conf"
     from: /^add_header Strict-Transport-Security 'max-age=31536000';$/
     to: |
       add_header Strict-Transport-Security 'max-age=31536000';

       # Server IP
       # set_real_ip_from my.ip.add.ress;
       # real_ip_header CF-Connecting-IP;

…with the last two lines commented out. I have uncomment the set_real_ip_from my.ip.add.ress and put the server’s IP address in there and rebuilt. I have left the second line real_ip_header CF-Connecting-IP because I do not use cloudfare. Is this correct? Do I need to do anything else?

In the other thread you and @mpalmer discussed investigating further - can you remember the outcome of that?

Sorry to bother you again.

1 Like