I am doing a load test for Discourse.
Here is my scenario:
- Take Discourse Meta - The Official Support Forum for Discourse for example. I append a api_key (generated from admin portal) as a query string. So the url I test is: https://meta.discourse.org/latest?api_key=XXXXX;
- I remove the api rate limit by disable rate.limit.template in app.yml and set DISCOURSE_MAX_ADMIN_API_REQS_PER_KEY_PER_MINUTE=10000;
- Run the url with multiple agents.
Here is the result:
And the CPU of postgreSQL reach 100%.
This is the result without api_key as query string:
As far as I know, the anonymous request will get cached result from Redis, while the authenticated request will not. But the load test with authenticated request is lower than my expectation.
Is there any wrong in my steps of load test? Or setting I missed?
Can you expand a bit on the test setup, how many unicorns, how is pg setup and so on…
- Front end: Azure Web Service for Container(Standard S3, 4 core, 7GB RAM) 10 instances, each have 8 unicorn workers.
- Cache: Azure Redis Cache Standard
- SQL: Azure PostgresQL service (preview) Standard, 800 Compute Units. It’s about a 8 core VM.
To give more information, I found that the result vary from different api_keys.
- api_key A is the admin user, which is used to create 1.5 million topics load test data.
- api_key B is a normal user, which has zero usage.
The api_key A runs load test very slow as the original post. While the api_key B run much faster.
Here is the throughput from api_key B:
I suspect there is a slow query in user related operation which cause the issue.
If the PostgreSQL CPU is reaching 100%, then there’s your bottleneck. You’ll presumably need a bigger database instance, or dig into what exactly the database is doing that’s so slow (via slow query logs, or otherwise).
Also, you may need to tune PostgreSQL to use your memory instead of the disk for sort operations and such.
shared_buffers are the principal ones.
Yes, the slow query is the a join operation of table
topic_user_ids. When user has created a lot of topics before, the operation of show latest page will be slow. I will try to tune it by using better PostgresSQL and
shared_buffers. Thanks @mpalmer nd @Falco
I dunno how your load test works but in a real use, when a lot of users hit your site, the content is already there, and PostgreSQL had time to pass the
VACUUM process everywhere, so stats are up to date and the planner plans are optimal. If you are creating thousands of topics/posts and reading at the same time, you may need more aggressive (or even manual)
Also, if the disks on the Azure PostgreSQL are fast you may need to lower
random_page_cost to a similar value of
Disclaimer: I may have a PostgreSQL T-shirt that I use to sleep sometimes.
I am not doing read/write at the same time. The load test reads from 1.5 million topics while no write operation occurs.
fast means SSD? Then yes, it’s SSD
You may want to consider having your load test use a pool of users that create posts, which might better simulate real world traffic and database state.
The usual traffic breakdown is very high anon traffic percentage. I just checked boingboing and
for the last 30 days, 400k logged in pageviews vs 4.3m, anon so about 10%. whoops that was wrong, not sure where I was looking there
Last 30 days all traffic, 4.7 million pageviews, of which 351,000 were logged in users. So 7.5% of traffic is logged in users.
Is this supposed to be read as “I have an unhealthy obsession with PostgreSQL”, or is it supposed to mean “The sum total of my PostgreSQL knowledge is a shirt that I got at a con”?
Quite similar, for our scenario, we have 10% traffic logged in users.
Our tests are showing around 3x performance gains when running PostgreSQL 10 in some slow queries, thanks to the new parallel scans.
Do you know how close Azure is on the PostgreSQL 10 upgrade?
Query 1 PG 9.5: 1258.871 ms
Query 1 PG 10: 293.879 ms
Query 2 PG 9.5: 15532.489 ms
Query 2 PG 10: 4889.282 ms
The Azure PostgreSQL is not GA yet. I don’t know if upgrade to 10 in in their roadmap
Curious, where is Postgres at on Azure these days? What is the most powerful instance you can get hardware wise?
For now, it’s server with 32 cores. There will be 64 cores machine in recent months.
Note, for anyone trying to do their own Discourse load testing like Simon, not only do you need to edit the MAX_ADMIN_API_REQS, you also need to comment out
# - "templates/web.ratelimited.template.yml"
Otherwise, you’ll probably hit the nginx rate limits on requests per IP address. If you see response code
429 (Too many requests) even after increasing the MAX ADMIN API REQS… this is why.
Also note that YAML variables use “:”, so if you copy and paste Simon’s
DISCOURSE_MAX_ADMIN_API_REQS_PER_KEY_PER_MINUTE=10000 you’ll get a YAML parsing error about a colon…
Hi Ryan, Do I need to rebuild the discourse for this change to become active?
Yes, you must do a rebuild of the container to activate this change.
Note that applies to any
app.yml changes since they’re only read once, while the container is