Migration using the API

(Danny Goodall) #1

Hi, tl;dr I am seeing strange 10-20 minute pauses from an AWS instance running Discourse 1.7.4 on Ubuntu when trying to migrate from MVCForum using the Python API.

I’m evaluating Discourse to see if it is a suitable replacement for a heavily modified version MVCForum.

I did first try to use the MVC Forum to Discourse SQL migration scripts by Michael Rätzel, but I ran into a problem in the F# code. (It is also not a perfect solution as it doesn’t know about the extensions we’ve made). Instead I decided to write a migration script in Python.

I’ve configured a Discourse instance using a Bitnami image of Discourse (1.7.4) on Ubuntu on AWS. I’m not precious about AWS / Bitnami and am happy to try any config / host that might be better suited. I also imagine that the image configuration is not turned to the actual size/shape of AWS host I push it to.

I wonder if it would be better for me to do the migration on a local instance, back it up and the restore to a cloud-hosted instance. Comments?

Anyway, I’m running into problems whereby the Amazon instance effectively stalls for 15-20 minutes. It becomes unresponsive to API requests as well as SSH sessions.

Amazon monitoring says the instance stays ‘healthy’, it seems to use no CPU cycles during this time and the connection from SSH isn’t lost - it’s just unresponsive.

The AWS instance is a t2.medium which supports bursty use of CPU, so I figured it might be being throttled by the instance itself. However, AWS claims that I will get constant (albeit slower) performance when the bursty CPU credit has been used up so this is not consistent with what I’m seeing. The t2.medium is 2 x vCPU and 3.75GB.

I know nothing about Ruby / Rails but am a part-time hacker in Python so I’m trying to use the pydiscourse module to manage the test migration.

What I ‘know

  • I thought I might be being throttled by Discourse, but as I say the SSH sessions also fail to respond so it appears to be something at the OS level.
  • It seems to be related to sidekiq because if I try to visit the /sidekiq admin URL the unresponsiveness will follow shortly afterwards.
  • Sidekiq is running and stays running throughout.
  • I’m not seeing anything consistent in the /logs view. I did see a number of “too many open files” but they didn’t coinciide with the pauses and I’ve increased the max files in Ubuntu with the same result
  • I am running sar on the Ubuntu box to monitor resource and on average when I see a ‘pause’ I have 1GB of free memory (sar -r), I see an approximate peak of 350 context switches per second (sar -w) and I see an approx.average of 40% CPU idle (sar -u).
  • I am seeing a lot of build up in sidekiq queues - jobs that are queued but don’t seem to be executing - but again nothing appears in /logs.
  • My email integration is configured to use sparkpost.com - and that shows that emails are being sent and received from the discourse instance and I am way below the daily/monthly limits of my eval plan.

Configuration changes I’ve made to allow me to use the API for migration

  • I have overridden a number of settings, largely to remove/increase the limits on making changes through the API - these can be seen below.

Things I’ve tried, but still fail to fix the problem

  • I’ve implemented rate limiting in my code so that I am only making 60-80 requests per minute and will then pause until I have more capacity.
  • I realised that with each user I created was, by default, receiving email for every post made in reply to a post they started. This was creating a great deal of email traffic but it seemed to be being delivered OK. I’ve since turned this off for every user but the problem still persists
  • In desperation I have deleted queues in sidekiq - I recognise that this may have caused more problems, but this was only after the issue was already happening.
  • As mentioned above, I’ve increased the open file limit in Ubuntu with no joy.

This screenshot shows system activity leading up to a pause

This screenshot shows my /logs entries

Any advice gratefully received.

(David Taylor) #2

You’re probably better off following the official install guide rather than using the bitnami setup, that way you’ll be able to get more useful help here.

The symptoms you’re describing sound very similar to what I had when running a migration on AWS (ssh locks up, no response from server for about 15-20 mins).

Do you have swap enabled?

(Danny Goodall) #3

Thanks for taking the time to respond, David.

I did consider swap when I originally tried Discourse on a 1CPU 2GB instance and saw the same problem. I assumed I was running out of memory and needed more.

I then tried the Medium instance with 3.75GB RAM, but when the problem re-appeared, I discounted swapping when there was apparently no contention for memory (~2GB free).

Is my understanding of Ubuntu swapping accurate - i.e. it only swaps if it needs to (insufficient physical memory)?

That said, I think your suggestion of using a Digital Ocean server and the official install guide makes a lot of sense.

I’ll try that and report back.

Cheers, Dan.

(Danny Goodall) #4

@David_Taylor, can I just ask, when you say “following the official install guide” - would that be the same as taking a pre-packaged DigitalOcean Discourse Droplet? I ask as DigitalOcean appears to be Discourse.org’s recommended hosting partner.

Or should I simply take a raw DigitalOcean Ubuntu distribution and follow the instructions here:

Thanks again, Dan.

(David Taylor) #5

This is the recommended install guide, it really does take less than 30 mins:

Personally I’d choose to follow that rather than using DigitalOcean’s prepackaged droplet.

(Rafael dos Santos Silva) #6

For migrations, we usually run a dev local instance to run it, and not a production install.

After the migration is done, you backup your local dev instance, upload to a production install and restore.

(Gerhard Schlager) #7

Yes. It is likely that your PC has a better CPU with higher single-core speed than the AWS instance.

A few thoughts:

  • Setup your Discourse instance with the official setup guide as @David_Taylor already mentioned.
  • Disable Sidekiq during the migration. You don’t need it during the import, especially when you’re using your PC.
  • Using the API is probably a lot slower than importing using Ruby code (which isn’t fast either) like all the existing import scripts do. At least take a look at the base_importer. It already does lots of important stuff like disabling emails…
  • I’m not sure if API requests (even the ones with full access) are rate limited.

(Matt Palmer) #8

If SSH is locking up, this isn’t anything related to Discourse specifically, but rather a problem with the machine and/or network. Excess swapping is a possibility, but I’d put it in the “unlikely” bucket because 15-20 minutes of absolutely no activity is maxtreme craziness for swap. It sounds more like network problems to me. A constant ping (or, better still, mtr) directed at the machine would be interesting to see.

(Danny Goodall) #9

Thanks for the tips @gerhard, and excuse my ignorance but when you say “Disable Sidekiq” do you mean simply stopping the process at the OS level, or is there a Discourse setting that controls that?

Also, with Sidekiq disabled does that mean that when I re-enable it I will have lots of queued jobs that will need to complete before the forum is fully functional?

For anyone else that finds this thread, these are the base_importer settings that @gerhard was referring…

      def get_site_settings_for_import
          email_domains_blacklist: '',
          min_topic_title_length: 1,
          min_post_length: 1,
          min_first_post_length: 1,
          min_private_message_post_length: 1,
          min_private_message_title_length: 1,
          allow_duplicate_topic_titles: true,
          disable_emails: true,
          authorized_extensions: '*'

I’d already stumbled across most of them, but disable_emails was very helpful.

Cheers, Dan.

(Gerhard Schlager) #10

Yes, I meant stopping the Sidekiq process. It will start processing all the queued jobs when you enable it again. All the basic stuff like sending emails will work right away, but the postprocessing of imported posts (downloading embedded images, creating oneboxes,…) can take a long time.

(Danny Goodall) #11

Just wanted to close the loop on this for posterity.

Per @David_Taylor’s suggestions, I started a Digital Ocean instance of Discourse using the official installation / Docker and haven’t had any of the long pauses I experience with the Bitnami image on Amazon Web Services.

HOWEVER, I did initially receive a lot of rate-limiting errors which I overcame by following the instructions in this post (with the following caveat).

I am left wondering whether the AWS instance was rate-limiting too, but instead of returning an error it simply sat there waiting?

Anyway, I don’t have the time or inclination to run up the AWS instance to test it as the Digitial Ocean instance is running beautifully.

Thanks for the help!