Load test with Discourse


(Simon Wu) #1

Hi,

I am doing a load test for Discourse.
Here is my scenario:

  1. Take Discourse Meta for example. I append a api_key (generated from admin portal) as a query string. So the url I test is: https://meta.discourse.org/latest?api_key=XXXXX;
  2. I remove the api rate limit by disable rate.limit.template in app.yml and set DISCOURSE_MAX_ADMIN_API_REQS_PER_KEY_PER_MINUTE=10000;
  3. Run the url with multiple agents.

Here is the result:


And the CPU of postgreSQL reach 100%.

This is the result without api_key as query string:

As far as I know, the anonymous request will get cached result from Redis, while the authenticated request will not. But the load test with authenticated request is lower than my expectation.

Is there any wrong in my steps of load test? Or setting I missed?


[Paid] Load Testing of a new Discourse site [STILL AVAILABLE]
(Sam Saffron) #2

Can you expand a bit on the test setup, how many unicorns, how is pg setup and so on…


(Simon Wu) #3
  1. Front end: Azure Web Service for Container(Standard S3, 4 core, 7GB RAM) 10 instances, each have 8 unicorn workers.
  2. Cache: Azure Redis Cache Standard
    image
  3. SQL: Azure PostgresQL service (preview) Standard, 800 Compute Units. It’s about a 8 core VM.

To give more information, I found that the result vary from different api_keys.

  1. api_key A is the admin user, which is used to create 1.5 million topics load test data.
  2. api_key B is a normal user, which has zero usage.

The api_key A runs load test very slow as the original post. While the api_key B run much faster.
Here is the throughput from api_key B:

I suspect there is a slow query in user related operation which cause the issue.



(Matt Palmer) #4

If the PostgreSQL CPU is reaching 100%, then there’s your bottleneck. You’ll presumably need a bigger database instance, or dig into what exactly the database is doing that’s so slow (via slow query logs, or otherwise).


(Rafael dos Santos Silva) #5

Also, you may need to tune PostgreSQL to use your memory instead of the disk for sort operations and such.

The settings work_mem and shared_buffers are the principal ones.


(Simon Wu) #6

Yes, the slow query is the a join operation of table topic and topic_user_ids. When user has created a lot of topics before, the operation of show latest page will be slow. I will try to tune it by using better PostgresSQL and work_mem and shared_buffers. Thanks @mpalmer nd @Falco


(Rafael dos Santos Silva) #7

I dunno how your load test works but in a real use, when a lot of users hit your site, the content is already there, and PostgreSQL had time to pass the VACUUM process everywhere, so stats are up to date and the planner plans are optimal. If you are creating thousands of topics/posts and reading at the same time, you may need more aggressive (or even manual) VACUUM triggers.

Also, if the disks on the Azure PostgreSQL are fast you may need to lower random_page_cost to a similar value of seq_page_cost.

Disclaimer: I may have a PostgreSQL T-shirt that I use to sleep sometimes.


(Simon Wu) #8

I am not doing read/write at the same time. The load test reads from 1.5 million topics while no write operation occurs.

Does fast means SSD? Then yes, it’s SSD :smile:


(Dave McClure) #9

You may want to consider having your load test use a pool of users that create posts, which might better simulate real world traffic and database state.


(Jeff Atwood) #10

The usual traffic breakdown is very high anon traffic percentage. I just checked boingboing and for the last 30 days, 400k logged in pageviews vs 4.3m, anon so about 10%. whoops that was wrong, not sure where I was looking there :confounded:

Last 30 days all traffic, 4.7 million pageviews, of which 351,000 were logged in users. So 7.5% of traffic is logged in users.


(Michael Howell) #12

Is this supposed to be read as “I have an unhealthy obsession with PostgreSQL”, or is it supposed to mean “The sum total of my PostgreSQL knowledge is a shirt that I got at a con”?


(Matt Palmer) #13

old-el-paso-girl-why-not-both


(Simon Wu) #14

Quite similar, for our scenario, we have 10% traffic logged in users.


(Rafael dos Santos Silva) #15

Hey @SimonWu,

Our tests are showing around 3x performance gains when running PostgreSQL 10 in some slow queries, thanks to the new parallel scans.

Do you know how close Azure is on the PostgreSQL 10 upgrade?

Some examples:

Query 1 PG 9.5: 1258.871 ms
Query 1 PG 10:   293.879 ms

Query 2 PG 9.5: 15532.489 ms
Query 2 PG 10:   4889.282 ms

(Simon Wu) #16

The Azure PostgreSQL is not GA yet. I don’t know if upgrade to 10 in in their roadmap


(Sam Saffron) #17

Curious, where is Postgres at on Azure these days? What is the most powerful instance you can get hardware wise?


(Simon Wu) #18

For now, it’s server with 32 cores. There will be 64 cores machine in recent months.


(Ryan Erwin) #19

Note, for anyone trying to do their own Discourse load testing like Simon, not only do you need to edit the MAX_ADMIN_API_REQS, you also need to comment out web.ratelimited.template.yml

Something like:

/var/discourse/containers/app.yml

templates:
   - "templates/postgres.template.yml"
   - "templates/redis.template.yml"
   - "templates/web.template.yml"
 # - "templates/web.ratelimited.template.yml"

...
env:
   LANG: en_US.UTF-8
   DISCOURSE_MAX_ADMIN_API_REQS_PER_KEY_PER_MINUTE: 10000
...

Otherwise, you’ll probably hit the nginx rate limits on requests per IP address. If you see response code 429 (Too many requests) even after increasing the MAX ADMIN API REQS… this is why.

Also note that YAML variables use “:”, so if you copy and paste Simon’s DISCOURSE_MAX_ADMIN_API_REQS_PER_KEY_PER_MINUTE=10000 you’ll get a YAML parsing error about a colon…