Analyze Discourse performance using NGINX logs

:warning: this is stale and out-of-date historic information

Our official Docker container ships with a custom NGINX config that logs very rich information that can be used to debug server performance issues.

Create a report

To run it (you will need to be on a very recent install, you must rebuild to update NGINX config):

cd /var/discourse
./launcher enter app
cd /var/www/discourse/plugins/discourse-nginx-performance-report/script
ruby nginx_analyze.rb

Report contents

The report provides the following:

Overview of request counts for the duration of the log

Total Requests: 11105 ( MessageBus: 6757 )

We can see that we processed 11105 requests total, 6757 were consumed by message_bus. To learn more about message bus see this post or search meta

Top 30 IPs by server load

Which IPs are responsible for the largest amount of CPU time?

IP Address       Duration
----------       --------
100.43.90.12        16.71
10.66.0.21           7.45
10.66.0.22           7.31
... etc ... 

So in the above we can see that Yandex crawling is causing a fair amount of load. As well as a couple of private IP addresses.

Top 30 Users by Server Load

Top 30 users by Server Load

Username       Duration Reqs   Routes                                                                          
--------       -------- ----   ------                                                                          

[Anonymous]     3746.81 132094 topics/show(1110.60) user_avatars/show(814.98) 
                               user_avatars/show_letter(713.23)  
                               forums/status(418.98) static/cdn_asset(206.63)                                   

alep             232.11   1723 list/latest(232.11)                                                              

sam               36.82    518 topics/show(14.12) list/latest(6.86)
                               topics/timings(4.48) posts/create(1.69)     
                               draft/update(1.09)                                                               


codinghorror      33.05    618 topics/wordpress(11.91) topics/show(11.14)
                               topics/timings(3.33) post_actions/create(1.88) 
                               posts/create(1.75)    

You can see counts of requests, total times and a breakdown by route cost.

For example … in this report @alep hit list/latest our front page 1723 times resulting in 232 seconds of processing time.

Top 30 routes by Server Load

Route                       Duration
-----                       --------
topics/show                    59.24
user_avatars/show_letter       21.07
user_avatars/show              14.92
forums/status                  14.77
list/latest                     4.94
topics/timings                  4.10
static/cdn_asset                3.46
list/category_latest            0.98
topics/feed                     0.97

This information is particularly useful for quite a few reasons and can be used for diagnostics.

Routes at the top of the list are ones that have the most impact on global performance so it makes sense for us to focus on improving them (or cutting down calls to them) first.

Top 30 URLs by server load

Url                                                            Duration
---                                                            --------
GET /srv/status HTTP/1.1                                         14.77
HEAD /t/what-do-user-trust-levels-do/4924 HTTP/1.1                5.34
POST /topics/timings HTTP/1.1                                     4.10
GET /t/what-do-user-trust-levels-do/4924 HTTP/1.1                 2.68
GET / HTTP/1.1                                                    2.16       

This shows us which URL are responsible for the most amount of load on the server. Perhaps there is a topic that has performance issues? Perhaps there is a page that is visited too often by bots?

Top 30 not found URLs (404)

Url                                                                      Count
---                                                                      -----
GET //assets-meta-cdck-prod-meta.s3.dualstack.us-west-1.amazonaws.com/original/3X/e/1/e1fc4c5714ea45bc473feea787021d39f7133d22.png HTTP/1.1                                9
GET //assets-meta-cdck-prod-meta.s3.dualstack.us-west-1.amazonaws.com/original/3X/7/2/72d2687d1d00c52c546727efbbc6d726f6e153de.png HTTP/1.1                                2
GET /node/add HTTP/1.1                                                                                                                               1

We can use this report to find broken links, in some cases it may make sense to place a redirect in place to help recover the traffic.

Automatic daily performance reports

You can also enable automatic daily performance reports on your site by checking:

daily performance report in site settings.

This will give you daily reports delivered to a single discourse topic (under staff category)

You can shift it to another category and watch it and then you get the reports via email. I would strongly recommend not moving the topic to a public category as you will leak out ips and sensitive info.

These reports also enable you to discuss the results in a topic more easily

How it works

To achieve rich diagnostics, Discourse adds 2 additional headers to all requests in application_controller.rb

 def set_current_user_for_logs
    if current_user
       Logster.add_to_env(request.env,"username",current_user.username)
       response.headers["X-Discourse-Username"] = current_user.username
    end
    response.headers["X-Discourse-Route"] = "#{controller_name}/#{action_name}"
 end

Furthermore we define a custom NGINX log format:

log_format log_discourse '[$time_local] $remote_addr "$request" "$http_user_agent" "$sent_http_x_discourse_route" $status $bytes_sent "$http_referer" $upstream_response_time $request_time "$sent_http_x_discourse_username"';

This means that we track: User Agent, Duration, Status, Rails Duration, Referrer and more for every request.

Out of the box NGINX does not log the time it spends in the “upstream” proxy, this is added to our logs using $upstream_response_time

Given this rich information I took a stab at creating a “reporting tool” that analyzes the logs and gives you some interesting reports, the tool is early in development, but already useful: https://github.com/discourse/discourse/blob/master/script/nginx_analyze.rb

27 Likes