Hi, tl;dr I am seeing strange 10-20 minute pauses from an AWS instance running Discourse 1.7.4 on Ubuntu when trying to migrate from MVCForum using the Python API.
I’m evaluating Discourse to see if it is a suitable replacement for a heavily modified version MVCForum.
I did first try to use the MVC Forum to Discourse SQL migration scripts by Michael Rätzel, but I ran into a problem in the F# code. (It is also not a perfect solution as it doesn’t know about the extensions we’ve made). Instead I decided to write a migration script in Python.
I’ve configured a Discourse instance using a Bitnami image of Discourse (1.7.4) on Ubuntu on AWS. I’m not precious about AWS / Bitnami and am happy to try any config / host that might be better suited. I also imagine that the image configuration is not turned to the actual size/shape of AWS host I push it to.
I wonder if it would be better for me to do the migration on a local instance, back it up and the restore to a cloud-hosted instance. Comments?
Anyway, I’m running into problems whereby the Amazon instance effectively stalls for 15-20 minutes. It becomes unresponsive to API requests as well as SSH sessions.
Amazon monitoring says the instance stays ‘healthy’, it seems to use no CPU cycles during this time and the connection from SSH isn’t lost - it’s just unresponsive.
The AWS instance is a t2.medium which supports bursty use of CPU, so I figured it might be being throttled by the instance itself. However, AWS claims that I will get constant (albeit slower) performance when the bursty CPU credit has been used up so this is not consistent with what I’m seeing. The t2.medium is 2 x vCPU and 3.75GB.
I know nothing about Ruby / Rails but am a part-time hacker in Python so I’m trying to use the pydiscourse module to manage the test migration.
What I ‘know’
- I thought I might be being throttled by Discourse, but as I say the SSH sessions also fail to respond so it appears to be something at the OS level.
- It seems to be related to sidekiq because if I try to visit the /sidekiq admin URL the unresponsiveness will follow shortly afterwards.
- Sidekiq is running and stays running throughout.
- I’m not seeing anything consistent in the /logs view. I did see a number of “too many open files” but they didn’t coinciide with the pauses and I’ve increased the max files in Ubuntu with the same result
- I am running sar on the Ubuntu box to monitor resource and on average when I see a ‘pause’ I have 1GB of free memory (sar -r), I see an approximate peak of 350 context switches per second (sar -w) and I see an approx.average of 40% CPU idle (sar -u).
- I am seeing a lot of build up in sidekiq queues - jobs that are queued but don’t seem to be executing - but again nothing appears in /logs.
- My email integration is configured to use sparkpost.com - and that shows that emails are being sent and received from the discourse instance and I am way below the daily/monthly limits of my eval plan.
Configuration changes I’ve made to allow me to use the API for migration
- I have overridden a number of settings, largely to remove/increase the limits on making changes through the API - these can be seen below.
Things I’ve tried, but still fail to fix the problem
- I’ve implemented rate limiting in my code so that I am only making 60-80 requests per minute and will then pause until I have more capacity.
- I realised that with each user I created was, by default, receiving email for every post made in reply to a post they started. This was creating a great deal of email traffic but it seemed to be being delivered OK. I’ve since turned this off for every user but the problem still persists
- In desperation I have deleted queues in sidekiq - I recognise that this may have caused more problems, but this was only after the issue was already happening.
- As mentioned above, I’ve increased the open file limit in Ubuntu with no joy.
This screenshot shows system activity leading up to a pause
This screenshot shows my /logs entries
Any advice gratefully received.