Locks up while running Sidekiq after migration

I’ve been trying to import my vBulletin forum into Discourse for a while now. I’ve been following this guide: Importing from vBulletin 4.

I’m running the product on my Macbook using the Vagrant image as described on https://github.com/discourse/discourse/blob/master/docs/VAGRANT.md.

I’ve managed to finish the main database import after about 56 hours (~750K posts, ~10K users), but I’m running into trouble in the sidekiq task. It’s queued up 2 million tasks, and while it starts off right (cpu usage of the host going to 400%), it seems to just lock up after a while; host CPU usage is at exactly 100%, and the Vagrant image becomes unresponsive - can’t exit the main discourse process, can’t exit Sidekiq, and htop (which I had also running in screen) was just as unresponsible.

I’ve run it again, this time not starting the main Discourse process and having htop in the foreground. It’s done it again, this is the last what I can see in htop:

Nothing major, just a lot of processes running at low speeds, not even high CPU usage, but it’s still frozen.

Is this an issue in the software somewhere, or is this a known problem when running in Vagrant? Should I just get the whole thing set up locally? Or are there tricks to making this work?

Some technical details: Virtualbox gets access to 8 CPUs (autodetected by vagrant; I have 4 physical cpu cores, 8 with hyperthreading) and 4 GB of memory, a late 2013 model Macbook Pro. I’ve tweaked the database and sidekiq settings so that the DB has a pool of 25 and sidekiq a parallelism of 10 (-c argument).

It looks like it’s in a deadlock of sorts.

I’m running the HEAD commit of the main discourse repo, commit b963307 from about a week ago.

Any configuration tweaks I can do to help fix this?

Does it start going again if you kill and restart it?

Yeah it seems to pick up where it left off, but after about half an hour I think? It just stops again.

OOOH! OOOH! I think I know this one!

Check /var/docker/shared/standalone/log/rails/*.log for mentions of Redis::TimeoutError. If they’re in there, you’ve probably hit this doozy of a Docker bug. The workaround is to restart Docker (service docker restart) and the problem seems to go away.

5 Likes

@mpalmer I did see that one pass by, but I’m not running the thing in Docker atm, but in Virtualbox (using the Vagrant developer setup). It might still be the same issue I guess, where’s the log file when you’re in development mode? I did see a develpment.log file, but it only has some SQL queries in it, nothing about Redis or whatever.

I did just go and find the Redis log though, and behold!

1527:M 24 Dec 07:14:55.453 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
1527:M 24 Dec 07:14:55.457 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted.
1527:M 24 Dec 07:14:55.457 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.

Looks like Redis is trying to up the file descriptors itself, but isn’t allowed to. No available file descriptors would make sense. However, cat /proc/sys/fs/file-max returns 811535 as a value, which should be high enough. Also I think this might just be logged on shutdown; regular redis logging while it’s doing work is this:

9624:C 22 Dec 14:22:27.188 * RDB: 98 MB of memory used by copy-on-write
1292:M 22 Dec 14:22:27.246 * Background saving terminated with success
1292:M 22 Dec 14:23:28.021 * 10000 changes in 60 seconds. Saving...
1292:M 22 Dec 14:23:28.023 * Background saving started by pid 9821
9821:C 22 Dec 14:23:33.794 * DB saved on disk
9821:C 22 Dec 14:23:33.794 * RDB: 98 MB of memory used by copy-on-write
1292:M 22 Dec 14:23:33.831 * Background saving terminated with success
1292:M 22 Dec 14:24:34.016 * 10000 changes in 60 seconds. Saving...
1292:M 22 Dec 14:24:34.018 * Background saving started by pid 10033

Also when starting up Redis gives these warnings:

1527:M 24 Dec 07:14:55.458 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1527:M 24 Dec 07:14:55.458 # Server started, Redis version 3.0.7
1527:M 24 Dec 07:14:55.458 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
1527:M 24 Dec 07:14:55.458 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.

For the first warning, TCP backlog settings, I’ve added sysctl -w net.core.somaxconn=65535 to /etc/rc.local as described on https://github.com/docker-library/redis/issues/35.
I’ve added vm.overcommit_memory = 1 to /etc/sysctl.conf and disabled transparent huge pages; warnings on startup disappear. Let’s see what happens.

So far so good; it’s ran for an hour and a half now and it’s still going; virtualbox cpu usage is 200-400% in the host, and it seems to be doing a couple dozen tasks a minute. Not very fast (considering it has to do 2 million of them), but hey, it works for now.

Nope; it did last longer this time, but still running into the deadlock situation after approx 2 hours. I’ll just run / install it native then. May put that as advice in the Vagrant documentation, just in case anyone else tries to do this.