Admin dashboard won’t load after upgrade to v2.2.0.beta3 +36

We recently updated to “v2.2.0.beta3 +36 tests-passed” and have started getting /admin/dashboard/general.json Server 502 errors.

On a quick search it looks very similar to this one:

…which was solved by @joffreyjaffeux but no solution given in the closed topic.

Is there anything diagnostic wise I can gather to help out on this, or any recent changes in the area meaning I should just try updating to a more recent?

Thanks for any help. This just seems isolated to the /admin otherwise we’re doing great.

Nothing of note in the logs, just the 502 from the admin .json, which looks to be a timeout:

/var/discourse/shared/standalone/log/rails# more unicorn.stderr.log
I, [2018-10-16T05:13:36.016595 #80] INFO -- : master done reopening logs
I, [2018-10-16T05:13:36.051096 #206] INFO -- : worker=0 done reopening logs
I, [2018-10-16T05:13:36.071959 #253] INFO -- : worker=2 done reopening logs
I, [2018-10-16T05:13:36.079130 #225] INFO -- : worker=1 done reopening logs
I, [2018-10-16T05:13:36.082894 #1264] INFO -- : worker=3 done reopening logs
E, [2018-10-16T17:02:00.830572 #80] ERROR -- : worker=1 PID:225 timeout (31s > 3
0s), killing
I, [2018-10-16T17:02:08.596101 #26765] INFO -- : worker=1 ready
E, [2018-10-16T17:10:47.306347 #80] ERROR -- : worker=2 PID:253 timeout (31s > 3
0s), killing
From https://github.com/discourse/discourse
e3c6dd2..b23ebf1 tests-passed -> origin/tests-passed
e3c6dd2..b23ebf1 master -> origin/master
 * [new branch] svg-icons -> origin/svg-icons
I, [2018-10-16T17:10:54.901712 #27709] INFO -- : worker=2 ready
D, [2018-10-16T17:11:17.368681 #80] DEBUG -- : waiting 16.0s after suspend/hiber
nation
E, [2018-10-16T18:51:14.634343 #80] ERROR -- : worker=3 PID:1264 timeout (31s > 
30s), killing
I, [2018-10-16T18:51:22.141804 #3833] INFO -- : worker=3 ready
D, [2018-10-16T18:51:44.697230 #80] DEBUG -- : waiting 16.0s after suspend/hiber
nation
E, [2018-10-16T19:07:10.412676 #80] ERROR -- : worker=0 PID:206 timeout (31s > 3
0s), killing
I, [2018-10-16T19:07:17.781331 #5188] INFO -- : worker=0 ready
E, [2018-10-16T19:16:05.922215 #80] ERROR -- : worker=2 PID:27709 timeout (31s >
 30s), killing
I, [2018-10-16T19:16:13.599249 #6114] INFO -- : worker=2 ready
E, [2018-10-16T19:20:28.217211 #80] ERROR -- : worker=2 PID:6114 timeout (31s > 
30s), killing
I, [2018-10-16T19:20:35.408446 #6574] INFO -- : worker=2 ready
D, [2018-10-16T19:20:58.285287 #80] DEBUG -- : waiting 16.0s after suspend/hiber
nation

Resources seems all ok?

>df -h
Filesystem                 Size  Used Avail Use% Mounted on
udev                       2.0G  4.0K  2.0G   1% /dev
tmpfs                      396M  408K  395M   1% /run
/dev/disk/by-label/DOROOT   79G   15G   60G  20% /
none                       4.0K     0  4.0K   0% /sys/fs/cgroup
none                       5.0M     0  5.0M   0% /run/lock
none                       2.0G  1.7M  2.0G   1% /run/shm
none                       100M     0  100M   0% /run/user
none                        79G   15G   60G  20% /var/lib/docker/aufs/mnt/b834156bb92ed60ad3d0683540e279fb90f980abb7625247e43db83f7a3cb640
shm                        512M  8.0K  512M   1% /var/lib/docker/containers/1ccedc3b8178735b0e091bbc1f42bbfdc6ba03b2676e81a21f210ff178c12d70/shm

>free -h
             total       used       free     shared    buffers     cached
Mem:          3.9G       3.7G       193M       1.0G        50M       1.5G
-/+ buffers/cache:       2.2G       1.7G
Swap:         2.0G        29M       2.0G

All of these paths work ok, it’s just the main /admin page (plus the moderation page is ok too):

admin/reports/signups
admin/reports/topics
admin/reports/posts
admin/reports/dau_by_mau
admin/reports/daily_engaged_users
admin/reports/new_contributors
admin/reports/top_referred_topics
admin/reports/trending_search

It seems like a pure timing issue, in that we up’d the timeout in

config/unicorn.conf.rb

from 30 to 60 seconds temporarily and now we can see the admin data ok.

As we’ve been running a few years, is there something we can optimize in our set-up to avoid the data taking a while to fetch? Our next stop is to pg analyze to see if a query is causing this, i.e. a missing index or something.

2 Likes

Something for @joffreyjaffeux to look at maybe?

1 Like

Hi,

nothing in /logs ? Don’t have much hope about this as charts are supposed to be ultra resilient now. So it’s probably coming from something else like disk-space, backups, or dashboard problems.

2 Likes

Hi,

Nothing in log/rails/production_error.log and just the timeout (as put above in this topic) in unicorn.stderr.log regarding the timeout.

It’s like the dashboard now takes about 35 seconds or so, so unicorn is killing it before it’s done. It only started happening on the recent update.

As put in the posts above, resources all look ok. A 2 vCPU DigitalOcean instance with 4GB memory, db_shared_buffers: “1024MB” and UNICORN_WORKERS: 4

Only other thing I thought of trying is db_work_mem has been left as default, so we could up that - do you think it would help the dashboard queries?

The forum performance is generally great, no issues, it’s just this dashboard page since the update this week.

I could update again, but thought it best to keep this config if it helped you track anything down.

I was meaning in « www.example.com/logs » but if you are sure you didnt miss anything ok.

Yep, they are good, in that no entries when causing this to 502.

Thanks Jeff, Joffrey has reached out via PM and we’re setting up some diagnostics to see what’s up. Cheers.

2 Likes

Throwing my hat into the ring as I am seeing the exact same issue since /admin/upgrade was performed… Still happens in Safe Mode with everything disabled.

1 Like

@joffreyjaffeux - let me know what you need from me to see what’s going on

1 Like

What plugins do you have? Remove all but the official plugin, then do a rebuild.

cd /var/discourse
./launcher rebuild app
2 Likes

I have all official except for Who’s Online.

We reverted just back to official plugins last week only but with the same result (no who’s online plugin etc).

I’m awaiting for @joffreyjaffeux to ping us back and haven’t wanted to change our environment before he can take a look, as I know recreating these things can sometimes be difficult. If I try a bunch of things then I might fix it but then we lose the opportunity for a common solution to make it back in the product.

Because our workaround is just a longer unicorn timeout then there isn’t a great urgency to try things to fix. My best guess would be there is some log or DB table that needs to be pruned by an update, and the queries for the dashboard just take a long time to complete (more than 30 seconds, so the /admin gets killed by Unicorn timeout). Either than or some sort of Postgres buffer tuning that #meta has that the stand-alone installs don’t have configured.

What/where did you tune for unicorn? I can make the adjustment until @joffreyjaffeux is able to figure this out.

There’s an ENV for it here:

container/app.yml

env:
UNICORN_TIMEOUT: 60

https://github.com/discourse/discourse/blob/master/config/unicorn.conf.rb#L36

3 Likes

Hmm yeah that didn’t work for me…

Our issues might be different then? Just to be more explicit on steps:

1 - Edit containers/app.yml via ssh, placing a new UNICORN_TIMEOUT: 60 line under the env section

2 - ./launcher rebuild app

Yeah that’s what I did…

env:
  ## How many concurrent web requests are supported?
  ## With 2GB we recommend 3-4 workers, with 1GB only 2
  UNICORN_WORKERS: 4
  UNICORN_TIMEOUT: 60

I have some big work to finish for tomorrow. Will have a look at this right after.

7 Likes