Analyzing Discourse Performance using NGINX logs


(Sam Saffron) #1

Our official Docker container ships with a custom NGINX config that logs very rich information that can be used to debug server performance issues.

To achieve rich diagnostics, Discourse add 2 additional headers to all requests in application_controller.rb

 def set_current_user_for_logs
    if current_user
       Logster.add_to_env(request.env,"username",current_user.username)
       response.headers["X-Discourse-Username"] = current_user.username
    end
    response.headers["X-Discourse-Route"] = "#{controller_name}/#{action_name}"
 end

Furthermore we define a custom NGINX log format:

log_format log_discourse '[$time_local] $remote_addr "$request" "$http_user_agent" "$sent_http_x_discourse_route" $status $bytes_sent "$http_referer" $upstream_response_time $request_time "$sent_http_x_discourse_username"';

This means that we track: User Agent, Duration, Status, Rails Duration, Referer and more for every request.

Out of the box NGINX does not log the time it spends in the “upstream” proxy, this is added to our logs using $upstream_response_time

Given this rich information I took a stab at creating a “reporting tool” that analyzes the logs and gives you some interesting reports, the tool is early in development, but already useful: https://github.com/discourse/discourse/blob/master/script/nginx_analyze.rb

To run it (you will need to be on a very recent install, you must rebuild to update NGINX config):

cd /var/discourse
./launcher enter app
cd /var/www/discourse/plugins/discourse-nginx-performance-report/script
ruby nginx_analyze.rb

The report provides the following:

###Overview of request counts for the duration of the log

Total Requests: 11105 ( MessageBus: 6757 )

We can see that we processed 11105 requests total, 6757 were consumed by message_bus. To learn more about message bus see this post or search meta

###Top 30 IPs by server load

Which IPs are responsible for the largest amount of CPU time?

IP Address       Duration
----------       --------
100.43.90.12        16.71
10.66.0.21           7.45
10.66.0.22           7.31
... etc ... 

So in the above we can see that Yandex crawling is causing a fair amount of load. As well as a couple of private IP addresses.

###Top 30 Users by Server Load

Top 30 users by Server Load

Username       Duration Reqs   Routes                                                                          
--------       -------- ----   ------                                                                          

[Anonymous]     3746.81 132094 topics/show(1110.60) user_avatars/show(814.98) 
                               user_avatars/show_letter(713.23)  
                               forums/status(418.98) static/cdn_asset(206.63)                                   

alep             232.11   1723 list/latest(232.11)                                                              

sam               36.82    518 topics/show(14.12) list/latest(6.86)
                               topics/timings(4.48) posts/create(1.69)     
                               draft/update(1.09)                                                               


codinghorror      33.05    618 topics/wordpress(11.91) topics/show(11.14)
                               topics/timings(3.33) post_actions/create(1.88) 
                               posts/create(1.75)    

You can see counts of requests, total times and a breakdown by route cost.

For example … in this report @alep hit list/latest our front page 1723 times resulting in 232 seconds of processing time.

###Top 30 routes by Server Load

Route                       Duration
-----                       --------
topics/show                    59.24
user_avatars/show_letter       21.07
user_avatars/show              14.92
forums/status                  14.77
list/latest                     4.94
topics/timings                  4.10
static/cdn_asset                3.46
list/category_latest            0.98
topics/feed                     0.97

This information is particularly useful for quite a few reasons and can be used for diagnostics.

Routes at the top of the list are ones that have the most impact on global performance so it makes sense for us to focus on improving them (or cutting down calls to them) first.

Top 30 urls by server load

Url                                                            Duration
---                                                            --------
GET /srv/status HTTP/1.1                                         14.77
HEAD /t/what-do-user-trust-levels-do/4924 HTTP/1.1                5.34
POST /topics/timings HTTP/1.1                                     4.10
GET /t/what-do-user-trust-levels-do/4924 HTTP/1.1                 2.68
GET / HTTP/1.1                                                    2.16       

This shows us which URL are responsible for the most amount of load on the server. Perhaps there is a topic that has performance issues? Perhaps there is a page that is visited too often by bots?

Top 30 not found urls (404)

Url                                                                      Count
---                                                                      -----
GET //discourse-meta.s3-us-west-1.amazonaws.com/original/3X/e/1/e1fc4c5714ea45bc473feea787021d39f7133d22.png HTTP/1.1                                9
GET //discourse-meta.s3-us-west-1.amazonaws.com/original/3X/7/2/72d2687d1d00c52c546727efbbc6d726f6e153de.png HTTP/1.1                                2
GET /node/add HTTP/1.1                                                                                                                               1

We can use this report to find broken links, in some cases it may make sense to place a redirect in place to help recover the traffic.


Google Chrome extension for discourse notifications
Image preview while posting through mobile?
User Profile Views - dramatic increase in count
Discourse unavailable with high load average
(Sam Saffron) #2

I just made this even awesomer, you can enable automatic daily performance reports on your site by checking:

daily performance report in site settings.

This will give you daily reports delivered to a single discourse topic (under staff category)

You can shift it to another category and watch it and then you get the reports via email. I would strongly recommend not moving the topic to a public category as you will leak out ips and sensitive info.

There is enormous amount of power in have a place to talk about this stuff.


(Daniel) #3

I enabled daily performance report but this was the output in a topic in the staff category overnight:

Report is only available in latest image, please run:

cd /var/discourse
./launcher rebuild app

I did a git pull; ./launcher rebuild app yesterday and can confirm /var/www/discourse/script/nginx_analyze.rb exists and works.

This version is installed: v1.4.0.beta4 +19

Any ideas?


(Kane York) #4

Has the report ran for you now? It’s been a day.


(Daniel) #5

Nope - just more replies by system with:

Report is only available in latest image, please run:

cd /var/discourse
./launcher rebuild app

(Sam Saffron) #6

Hmm, is there anything custom about your setup?


#7

Since this is reported in a normal reply, would it be possible to wrap the lines sooner to avoid horizontal scrollbars?


(Lee_Ars) #8

Same thing with me—every day it’s “Report is only available in latest image,” in spite of having done a launcher rebuild app that morning. And same as with @danwad—actually running nginx_analyze.rb from within the container works without error.


(Sam Saffron) #9

Can you run from the command line and see what happens?

You just need to run “ruby /var/www/discourse/script/nginx_analyze.rb” after entering container.

Is there anything in logs?


(Lee_Ars) #10

@sam Yep, runs fine, produces plenty of output.


(Sam Saffron) #11

what are permissions on /var/log/nginx/


(Lee_Ars) #12

Looks like this:

drwxr-x---  2 www-data www-data   4096 Jul  8 07:35 nginx/

(Sam Saffron) #13

on the files as well ?

any error in logs when running the job?

what does this do?

rails c
> Jobs:: DailyPerformanceReport.new.execute(nil)

Daily perf reports asking to rebuild container on new install
(Lee_Ars) #14

Looks like the contents of the directory are set variously:

root@frylock-app:/# ll /var/log/nginx/
total 8128
drwxr-x---  2 www-data www-data    4096 Jul  8 07:35 ./
drwxrwxr-x 11 root     syslog      4096 Jul  7 07:35 ../
-rw-r-----  1 www-data www-data 3144191 Jul  8 23:16 access.log
-rw-r-----  1 www-data www-data 5033222 Jul  8 07:34 access.log.1
-rw-r--r--  1 root     root      117436 Jul  7 07:34 access.log.2.gz
-rw-r-----  1 www-data www-data    1168 Jul  8 17:55 error.log
-rw-r--r--  1 root     root         802 Jul  6 17:49 error.log.1

No errors are being logged at the /logs URL, unfortunately.

However, running that command directly at the rails console gives a permission denied error, so that’s probably our issue!

root@frylock-app:/# rails c
[1] pry(main)> Jobs:: DailyPerformanceReport.new.execute(nil)
/var/www/discourse/script/nginx_analyze.rb:121:in `initialize': Permission denied - /var/log/nginx/access.log (Errno::EACCES)
    from /var/www/discourse/script/nginx_analyze.rb:121:in `open'
    from /var/www/discourse/script/nginx_analyze.rb:121:in `block in analyze'
    from /var/www/discourse/script/nginx_analyze.rb:120:in `each'
    from /var/www/discourse/script/nginx_analyze.rb:120:in `analyze'
    from /var/www/discourse/script/nginx_analyze.rb:54:in `analyze'
    from /var/www/discourse/script/nginx_analyze.rb:163:in `<main>'
=> #<Post id: 12887, user_id: -1, topic_id: 1020, post_number: 8, raw: "Report is only available in latest image, please r...", cooked: "<p>Report is only available in latest image, pleas...", created_at: "2015-07-08 23:17:27", updated_at: "2015-07-08 23:17:27", reply_to_post_number: nil, reply_count: 0, quote_count: 0, deleted_at: nil, off_topic_count: 0, like_count: 0, incoming_link_count: 0, bookmark_count: 0, avg_time: nil, score: nil, reads: 0, post_type: 1, vote_count: 0, sort_order: 8, last_editor_id: -1, hidden: false, hidden_reason_id: nil, notify_moderators_count: 0, spam_count: 0, illegal_count: 0, inappropriate_count: 0, last_version_at: "2015-07-08 23:17:27", user_deleted: false, reply_to_user_id: nil, percent_rank: 1.0, notify_user_count: 0, like_score: 0, deleted_by_id: nil, edit_reason: nil, word_count: 16, version: 1, cook_method: 1, wiki: false, baked_at: "2015-07-08 23:17:27", baked_version: 1, hidden_at: nil, self_edits: 0, reply_quoted: false, via_email: false, raw_email: nil, public_version: 1>

I could chmod access.log so everyone has read access—that might do it.

edit - Well, chmodding it (along with the nginx subdirectory) to 644 didn’t do it (same error). Should I go more permissive?


(Sam Saffron) #15

Something is odd:

Why is this not running?

Also, it should work now according to perms (try again from /sidekiq/schedular)

cause of this:

(chpst -u discourse:www-data)

Are you on an old image? (fixed to an old image)


(Lee_Ars) #16

I last updated yesterday AM, but I’ll run launcher rebuild app right now and try again. One moment.

Completed the rebuild, same error when executing from the rails console.

/var/www/discourse/script/nginx_analyze.rb:121:in `initialize': Permission denied - /var/log/nginx/access.log (Errno::EACCES)

The update also reset the permission changes I made (which I’d expect to happen).

Confirmed that my web.template.yml lines 36-42 and the LD_PRELOAD line both match yours exactly.

Tried triggering the job from the Sidekiq scheduler tab and got the same error :frowning:

edit -

I don’t know if this is entirely permission related. I just did a chmod -R 777 /var/log/nginx to just blow everythign wide open, and I get the exact same error in the rails console when executing the job:

/var/www/discourse/script/nginx_analyze.rb:121:in `initialize': Permission denied - /var/log/nginx/access.log (Errno::EACCES)

edit again:
chpst -u discourse:www-data touch /var/log/nginx/ works. But chpst -u discourse:www-data touch /var/log/nginx/access.log yields access denied. What the fuck seriously.


(Sam Saffron) #17

hmm … can you reset permissions recursively on that directory?


(Lee_Ars) #18

Just nuked them all with another launcher app rebuild, so everything’s reset now (and I’m super-duper updated!).


(Lee_Ars) #19

FYI @sam, still getting the same Permission denied message today after doing a launcher rebuild app.


(Lee_Ars) #20

Sadly, still happening even this morning after an update to 1.4.0.beta5.