Hi, I’ve completed a migration from phpBB3 to Discourse. For several hours the new install has performed flawlessly, now suddenly the server keeps falling over and returning a 500 error. Each time I need to reboot the server, then rebuild the app to access the site again. The last few times of going through this process it’s displayed the site for just seconds before it falls over again.
The sidekiq queue has around 1,350,200 items to process. Would this be a potential cause for the server to fail? It’s running on a 4GB / 2 core DO droplet that’s averaging around 60% RAM usage when the site’s up.
Looking at the logs I see the following error has been logged 8 times:
You will have to let sidekiq do its job. It can take quite a while to empty the queue and you might see some problems until the larger jobs got a chance to complete for the first time. After an import that’s to be expected. Rebooting won’t do you much good. The only thing that helps are more and faster CPUs. Then you could tweak the number of sidekiq workers… otherwise you’ll need lots of patience.
Thanks @gerhard (again!) the problem I’m facing though is the sites down until I run the reboot / rebuild. But then it goes down almost immediately, so sidekiq can’t run?
Let it run a couple of uninterrupted hours (e.g. over night). The site should start working again after some jobs finished in the background. I’ve seen the same behavior multiple times on some of the slower droplets.
Job exception: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
Would this be related or am I staring at yet another issue? There’s plenty of memory available:
So I’ve let it run overnight and the number of queued jobs have increased from 1,362,215 to 1,397,195 and again as soon as I reboot / rebuild to be able to view the sidekiq dashboard within seconds the site falls over and returns 500 errors.
Can the sidekiq worker run if the site is returning 500 errors?
Hmm, I would have expected it to start working again. Something is not right…
Sidekiq should be working in the background even though the server returns errors. An increasing number of enqueued jobs could be normal, as long the the count of processed jobs increased over night as well. Did it?
You could try taking a look at Sidekiq in the rails console:
The processed jobs aren’t increasing whilst the server is down. Each time I do a reboot / rebuild the process jobs increase for the 25-30 seconds the server is up, then they stop. When I do another reboot / rebuild cycle the number of processed jobs is the same as when the server crashes.
When I run
./launcher enter app
rails c
I see the same MISCONF Redis error:
# rails c
bundler: failed to load command: pry (/var/www/discourse/vendor/bundle/ruby/2.4.0/bin/pry)
Redis::CommandError: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
Would exporting the Discourse database and building another fresh install, then importing the data be a sensible route?
@gerhard thank you again for your assistance! Upgrading the RAM has got the site running, feels like such an obvious solution! I was relying on the DO dashboard that was telling me the servers RAM was only at 60% capacity, lesson learned there!
It’s been running for 30 minutes without issue and sidekiq has worked through 20k of the 1.5m queued items!
This seems extreme now! We’re running an 8 GB / 4 core DO droplet that’s getting no traffic and the site keeps falling over. Is this likely to be down to the sidekiq queue? The queue is currently at 3.1 million records and growing, it’s processed 600k in the last 36 hours with all our outages.
# free -m
total used free shared buff/cache available
Mem: 7983 4728 132 1622 3122 1338
Swap: 2047 30 2017
We’re running UNICORN_WORKERS 8
At it’s current process rate of 10 queued items a second, provided the serve stays up we’re looking at a further 86 hours. We imported around 220,000 posts, do these numbers/times seem normal?
I’m not familiar with the topic_notification_level_changed job. Is that the one that’s keeping your Unicorn workers busy? 10 jobs/second is very slow - I only experienced that when rebaking posts with a lot of graphics in them that needed resizing.