Load test with Discourse

SimonWu · January 5, 2018, 7:31am

Hi,

I am doing a load test for Discourse.
Here is my scenario:

Take Discourse Meta - The Official Support Forum for Discourse for example. I append a api_key (generated from admin portal) as a query string. So the url I test is: https://meta.discourse.org/latest?api_key=XXXXX;
I remove the api rate limit by disable rate.limit.template in app.yml and set DISCOURSE_MAX_ADMIN_API_REQS_PER_KEY_PER_MINUTE=10000;
Run the url with multiple agents.

Here is the result:

And the CPU of postgreSQL reach 100%.

This is the result without api_key as query string:

As far as I know, the anonymous request will get cached result from Redis, while the authenticated request will not. But the load test with authenticated request is lower than my expectation.

Is there any wrong in my steps of load test? Or setting I missed?

sam · January 5, 2018, 8:29am

Can you expand a bit on the test setup, how many unicorns, how is pg setup and so on…

SimonWu · January 5, 2018, 8:55am

Front end: Azure Web Service for Container(Standard S3, 4 core, 7GB RAM) 10 instances, each have 8 unicorn workers.
Cache: Azure Redis Cache Standard
SQL: Azure PostgresQL service (preview) Standard, 800 Compute Units. It’s about a 8 core VM.

To give more information, I found that the result vary from different api_keys.

api_key A is the admin user, which is used to create 1.5 million topics load test data.
api_key B is a normal user, which has zero usage.

The api_key A runs load test very slow as the original post. While the api_key B run much faster.
Here is the throughput from api_key B:

I suspect there is a slow query in user related operation which cause the issue.

mpalmer · January 5, 2018, 9:28am

If the PostgreSQL CPU is reaching 100%, then there’s your bottleneck. You’ll presumably need a bigger database instance, or dig into what exactly the database is doing that’s so slow (via slow query logs, or otherwise).

Falco · January 5, 2018, 2:24pm

Also, you may need to tune PostgreSQL to use your memory instead of the disk for sort operations and such.

The settings work_mem and shared_buffers are the principal ones.

SimonWu · January 5, 2018, 2:39pm

Yes, the slow query is the a join operation of table topic and topic_user_ids. When user has created a lot of topics before, the operation of show latest page will be slow. I will try to tune it by using better PostgresSQL and work_mem and shared_buffers. Thanks @mpalmer nd @Falco

Falco · January 5, 2018, 2:43pm

I dunno how your load test works but in a real use, when a lot of users hit your site, the content is already there, and PostgreSQL had time to pass the VACUUM process everywhere, so stats are up to date and the planner plans are optimal. If you are creating thousands of topics/posts and reading at the same time, you may need more aggressive (or even manual) VACUUM triggers.

Also, if the disks on the Azure PostgreSQL are fast you may need to lower random_page_cost to a similar value of seq_page_cost.

Disclaimer: I may have a PostgreSQL T-shirt that I use to sleep sometimes.

SimonWu · January 5, 2018, 3:39pm

I am not doing read/write at the same time. The load test reads from 1.5 million topics while no write operation occurs.

Does fast means SSD? Then yes, it’s SSD

mcwumbly · January 5, 2018, 3:53pm

You may want to consider having your load test use a pool of users that create posts, which might better simulate real world traffic and database state.

codinghorror · January 5, 2018, 11:33pm

The usual traffic breakdown is very high anon traffic percentage. I just checked boingboing and ~~for the last 30 days, 400k logged in pageviews vs 4.3m, anon so about 10%.~~ whoops that was wrong, not sure where I was looking there

Last 30 days all traffic, 4.7 million pageviews, of which 351,000 were logged in users. So 7.5% of traffic is logged in users.

notriddle · January 6, 2018, 12:21am

Is this supposed to be read as “I have an unhealthy obsession with PostgreSQL”, or is it supposed to mean “The sum total of my PostgreSQL knowledge is a shirt that I got at a con”?

mpalmer · January 6, 2018, 12:29am

old-el-paso-girl-why-not-both

SimonWu · January 9, 2018, 2:17am

Quite similar, for our scenario, we have 10% traffic logged in users.

Falco · January 9, 2018, 2:22am

Hey @SimonWu,

Our tests are showing around 3x performance gains when running PostgreSQL 10 in some slow queries, thanks to the new parallel scans.

Do you know how close Azure is on the PostgreSQL 10 upgrade?

Some examples:

Query 1 PG 9.5: 1258.871 ms
Query 1 PG 10:   293.879 ms

Query 2 PG 9.5: 15532.489 ms
Query 2 PG 10:   4889.282 ms

SimonWu · January 9, 2018, 3:06am

The Azure PostgreSQL is not GA yet. I don’t know if upgrade to 10 in in their roadmap

sam · March 8, 2018, 4:36am

Curious, where is Postgres at on Azure these days? What is the most powerful instance you can get hardware wise?

SimonWu · March 9, 2018, 5:50am

For now, it’s server with 32 cores. There will be 64 cores machine in recent months.

ryanerwin · April 12, 2018, 1:22pm

Note, for anyone trying to do their own Discourse load testing like Simon, not only do you need to edit the MAX_ADMIN_API_REQS, you also need to comment out web.ratelimited.template.yml

Something like:

/var/discourse/containers/app.yml

templates:
   - "templates/postgres.template.yml"
   - "templates/redis.template.yml"
   - "templates/web.template.yml"
 # - "templates/web.ratelimited.template.yml"

...
env:
   LANG: en_US.UTF-8
   DISCOURSE_MAX_ADMIN_API_REQS_PER_KEY_PER_MINUTE: 10000
...

Otherwise, you’ll probably hit the nginx rate limits on requests per IP address. If you see response code 429 (Too many requests) even after increasing the MAX ADMIN API REQS… this is why.

Also note that YAML variables use “:”, so if you copy and paste Simon’s DISCOURSE_MAX_ADMIN_API_REQS_PER_KEY_PER_MINUTE=10000 you’ll get a YAML parsing error about a colon…

pradeepconnecpath · August 13, 2018, 5:44pm

Hi Ryan, Do I need to rebuild the discourse for this change to become active?

ryanerwin · August 15, 2018, 10:15am

Yes, you must do a rebuild of the container to activate this change.
Note that applies to any app.yml changes since they’re only read once, while the container is built.

Topic		Replies	Views
Long loading times for user summary page with slow database Installation unsupported-install	19	1754	July 13, 2023
Slow Profile Loads with 100GB+ database Feature	43	2862	January 1, 2021
PostgreSQL runaway IO Installation	37	2947	March 17, 2021
Slow performance due to per query latency connecting to database? Installation	9	765	September 11, 2018
Performance, Scaling, and HA requirements Hosting	8	7289	April 2, 2017

Load test with Discourse

/var/discourse/containers/app.yml

Related topics