As far as I know, the anonymous request will get cached result from Redis, while the authenticated request will not. But the load test with authenticated request is lower than my expectation.
Is there any wrong in my steps of load test? Or setting I missed?
If the PostgreSQL CPU is reaching 100%, then there’s your bottleneck. You’ll presumably need a bigger database instance, or dig into what exactly the database is doing that’s so slow (via slow query logs, or otherwise).
Yes, the slow query is the a join operation of table topic and topic_user_ids. When user has created a lot of topics before, the operation of show latest page will be slow. I will try to tune it by using better PostgresSQL and work_mem and shared_buffers. Thanks @mpalmer nd @Falco
I dunno how your load test works but in a real use, when a lot of users hit your site, the content is already there, and PostgreSQL had time to pass the VACUUM process everywhere, so stats are up to date and the planner plans are optimal. If you are creating thousands of topics/posts and reading at the same time, you may need more aggressive (or even manual) VACUUM triggers.
Also, if the disks on the Azure PostgreSQL are fast you may need to lower random_page_cost to a similar value of seq_page_cost.
Disclaimer: I may have a PostgreSQL T-shirt that I use to sleep sometimes.
You may want to consider having your load test use a pool of users that create posts, which might better simulate real world traffic and database state.
The usual traffic breakdown is very high anon traffic percentage. I just checked boingboing and for the last 30 days, 400k logged in pageviews vs 4.3m, anon so about 10%. whoops that was wrong, not sure where I was looking there
Last 30 days all traffic, 4.7 million pageviews, of which 351,000 were logged in users. So 7.5% of traffic is logged in users.
Is this supposed to be read as “I have an unhealthy obsession with PostgreSQL”, or is it supposed to mean “The sum total of my PostgreSQL knowledge is a shirt that I got at a con”?
Note, for anyone trying to do their own Discourse load testing like Simon, not only do you need to edit the MAX_ADMIN_API_REQS, you also need to comment out web.ratelimited.template.yml
Otherwise, you’ll probably hit the nginx rate limits on requests per IP address. If you see response code 429 (Too many requests) even after increasing the MAX ADMIN API REQS… this is why.
Also note that YAML variables use “:”, so if you copy and paste Simon’s DISCOURSE_MAX_ADMIN_API_REQS_PER_KEY_PER_MINUTE=10000 you’ll get a YAML parsing error about a colon…
Yes, you must do a rebuild of the container to activate this change.
Note that applies to any app.yml changes since they’re only read once, while the container is built.