Discourse not able to run through calculate average timings on a low resource environment


(Alex Garnett) #1

I originally made this thread about the issue I’m having but it wound up being sidetracked about IP header forwarding from a host nginx to the container (which was also a problem in my environment that I was happy to get resolved, though it seems like it was only exacerbating the issue, not the root cause):

Basically, I’m running Discourse on minimum spec (1GB of memory plus 2GB of swap) and the site would go down or very nearly go down, chugging a lot about once a day. If I looked at top while this was happening I noticed that a bit of swap had been released (like 200-300M) and inevitably a single postmaster process would be running for about two minutes, after which it would be killed and things would gradually return to normal.

I’ve just noticed that this happens when the container nginx logs rotate (i.e., access.log.1 is created), which seems like a big clue. Is your internal messaging queue (n.b. I don’t really know how Discourse works at all internally) somehow not handling this gracefully?


(Jay Pfaffman) #2

What version of Docker are you using? Or try doing a

./launcher rebuild app

Also, If you’re running stuff other than Discourse on the server, you probably just need more than 1GB of ram.


(Alex Garnett) #3

this has been happening for at least the last couple months of Discourse releases (during which I was pretty good about updating).

Docker version 17.05.0-ce, build 89658be

and I don’t think it’s a memory issue – the system basically never dips into its second GB of swap as is (I only added the second GB recently trying to troubleshoot this issue).


(Matt Palmer) #4

Correlating this with nginx log rotation is an interesting clue… There is a process inside Discourse that creates a daily post in the “Website performance reports” topic in the “Staff” category, based on nginx log data, and IIRC it runs after log rotation. The disk I/O required to read the day’s log data, plus compress the previous rotated logs, along with the memory needed to generate the report (which on a memory-constrained system means swapping), I’m not surprised your system is “chugging” for a few minutes.

You can disable the report generation with the “daily performance report” site setting; untick that, and see if things get any better.


(Sam Saffron) #5

That is default disabled, also its scheduled independently.

Log rotation is heavy on CPU and IO though as you point out.

I am still extremely confused though cause you were seeing 429s, which means you hit a rate limit, and I am totally unclear if you fixed the IP address issue you have.


(Alex Garnett) #6

I did! Thanks for your help:


(Sam Saffron) #7

Lets just put this out there… how fast is this server of yours:

dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=direct

Be sure to delete the test file once done.

Also how fast is the CPU

cat /proc/cpuinfo | grep 'model name' | uniq

(Alex Garnett) #8

Can do, but it’s just a DigitalOcean 1CPU/1GB instance, so it shouldn’t be too terrible or too exotic:

root@selectbutton:~# dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 28.0073 s, 38.3 MB/s
root@selectbutton:~# cat /proc/cpuinfo | grep 'model name' | uniq
model name      : Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz

(Sam Saffron) #9

This is somewhat confusing, this is the pg process, are you 100% sure this is not a daily schedule causing it vs log rotation.

Try triggering daily here (also, look at history there, any jobs taking huge amounts of time):

https://your-site.com/sidekiq/scheduler

Also anything in

https://your-site.com/logs


(Alex Garnett) #10

oh, I didn’t realize that was postgres, I thought that was a broker.

that looks like a likely candidate based on when it ran last and the absence of a duration. Nothing else is reporting a >2m runtime (the closest is “EnsureDbConsistency” reporting 56s).


(Matt Palmer) #11

cough

module Jobs
  class CalculateAvgTime < Jobs::Scheduled
    every 1.day

    # PERF: these calculations can become exceedingly expnsive
    #  they run a huge gemoetric mean and are hard to optimise
    #  defer to only run once a day
    def execute(args)
      # Update the average times
      Post.calculate_avg_time(2.days.ago)
      Topic.calculate_avg_time(2.days.ago)
    end
  end
end

(Alex Garnett) #12

OK, so this is basically expected behaviour in a limited-resource environment? It still seems to block other aspects of the app while it’s running – the ruby processes can’t get any CPU time during those couple minutes as far as I can tell – but if it’s not a bug as such, it makes sense that I was only seeing the 429s during this job run if that was when people kept trying to refresh the site when IP header forwarding wasn’t in place properly.

Can I at least reschedule that to run at 3am?


(Sam Saffron) #13

The query already only looks at topics that were bumped in the last 2 days. Do you have any giant topics that keep on getting bumped?

(note this includes GIGANTIC private messages) which I think we can possibly remove.


(Alex Garnett) #14

Yup. That’s inherent to how this forum works…


(Sam Saffron) #15

Please explain.

Gigantic public topics OR gigantic private messages. How many of them do you have?


(Alex Garnett) #16

probably about 20 topics that are >500 replies long at any given time? no PMs that huge afaik.


(Sam Saffron) #17

well you are in a bit of a pickle here, there is no way of disabling the schedule short of writing a plugin and your pg process does not appear to have enough grunt to run through the query it needs to.

Is an extra $10 a month out of the question here, considering that is about the cost of one beer.


(Alex Garnett) #18

I think you’ve been in san francisco too long if you think that’s one beer :slight_smile:

edit for posterity: I just checked and you’re in sydney, which makes even more sense. australians should know not to use beer-price analogies!

but, OK, point taken, if that’s as far as we go with this, that’s as far as we go. I’ve been putting off bumping the server because it hasn’t really seemed like we need it but hopefully post volume will catch up with me and force my hand on this one. obviously I’d recommend trying to schedule the heavier jobs overnight upstream if possible but I understand if it’s a won’tfix.


(Sam Saffron) #19

A trick you can do is trigger it at night, then it will not trigger for another 24 hours after. (but it always adds some jitter so over time it skews)

Extra memory is what your PG server wants, and once you are no longer struggling you can even give the planner more memory if needed.

Sunny Sydney :sun_with_face: :beach_umbrella: but as all hipsters I prefer my beers fancy.


(Alex Garnett) #20

any chance I can call it via cron once a month to jury-rig this?