If it helps anyone else, we’ve found that for our hardware of a dual proc DigitalOcean with 4GB seems to fall behind with a default ‘rebake old posts’ set to 80. We’ve been going for a few years and have a lot of image intensive posts, so YMMV.
Our sidekiq was clear of PostProcess jobs about 12 hours ago, but it’s crept up to about 4,500 enqueued again this morning, causing delays in new post processing.
I’ve set the rebake_old_posts_count value to ‘20’ and it seems to be slowly making headway now. The setting can be found via admin here:
@sam even though this is an unusual rebake, being especially CPU intensive (and I imagine some S3 back/forth as well), is there a way that this rebake value could be backed off if sidekiq is backlogged so much? I’m not meaning anything too complex an algorithm, just something that could automatically half it if the system is increasingly getting behind?
I know you guys are (rightfully) hosting tech-agnostic with Discourse, but the ideal for AWS S3 image hosted people would be moving the image processing AWS Lambda side. I’ve done that on Rails systems when it was javascript and imageMagick (moving it away from background work on the Rails side) and it works pretty well (as S3 can trigger a function on ObjectCreate completely on the AWS side, build a hierarchy of resized images etc).
Theoretically it’s got even easier now with AWS Lambda Ruby official support. You might be able to use the same image processing code on the Lamda side. The downside would be the AWS config API cruft to set up the trigger events etc.
Not that great an idea, but I thought I’d throw that out there for long term chewing.
Yeah, it is an idea I have thought about in the past but we would need a custom image for this do line up all the right versions of image magick and so on. We may get there one day but odds are this is going to be in the “mega enterprise” type of setup.
In theory you can run an AWS AutoScalingGroup that runs the Discourse image, with 0 unicorns, some sidekiqs, and autoscale based in Sidekiq queue size.
We almost did this for a customer, but managed another cheaper way.
Running stuff with the same base image instead of a custom lambda setup (that adheres to the lambda function size restriction) allows us to have only one code path to keep working, instead of two different ways to do the same, debug, update, keep in sync, etc.
…and it fitted what we needed, which was async processing of uploaded or edited images, but with processing just on the AWS side. It allowed us to go browser → S3 direct, and then by naming convention (or object path in terms of S3) we could rely on the optimization/resizes used by convert without worrying the Rails server side queues with work.
I think what you do with Discourse needs to cover much more feature points, and I agree with you that duplicating that side to support those with, or without, AWS would be a dev pain, and probably not worth it. I just like the general approach as it is one of the few genuine approaches where a Lambda function platform could be useful for surged scale stuff.
Yep the tradeoff is exactly this. We would be building a new path for 0.01% of the sites, that would increase dev complexity by 100%.
Something we may pursue in custom plugins for
I was toying with a project for this but for doing videos.
Transcoding those is out of reach for most VPSs out there, and a more acceptable trade off, for a way better user experience, with multiple resolutions, codecs, etc.
We ended up using Elastic Transcoder for this, with some Lambda’s triggered from S3 to feed the encoding pipeline with jobs. We used SNS notifications when the encoding was complete to remove the original unencoded blob etc. It works pretty well, and ended up being the simplest thing that worked.
We tried various approaches, with pure Lambda and FFmpeg, and then a small EC2 image autoscaled, and in the end discovered we were pretty much rebuilding Elastic Transcoder anyway. As per AWS, it’s not great value, but for dev experience it worked well enough.
Why can’t new posts jump the queue here, so they get processed first? It seems odd that background image rebuilds of old posts would pre-empt a new post someone just made.
Basically real new posts someone just made get inserted at the head of the queue, rebake requests for old existing posts get inserted at the back of the queue?
At the moment the rebake happens in the “low priority” queue. New posts happen in the “normal” queue.
However, if we can not keep up we can starve the “low” queue, it includes the jobs “notify mailing list”, “pull hotlinked images”, “create avatar thumbnails”, “anonymize user”, “send user emails”, “update username for username renames”
We can introduce a “super low priority” queue purely for rebakes, but it seems rather complicated to me.
Conceptually it does seem there is a reasonable need for a “we don’t even care if this ever gets done, really” queue priority. Basically a priority below low, like “nice to have”.
Because we do care about the other stuff, but these rebakes? If they fail to happen it really does not even matter, does it? I mean we could start over, do it next month, etc.