My journey into a massive posts rebake job

bartv · April 8, 2018, 12:25pm

I’m continuing this conversation from ‘Rebuild HTML for entire topic’ as my experiments are going into quite another direction and I thought there might be value in sharing my thoughts and results as I go along.

My situation is the following: we’re on the brink of launching a new migrated forum with over 4M posts. These will require a rebake when we switch to the final domain, and the posts need processing to make sure images are embedded correctly etc too.

My concerns are:

Rebaking is not a fast process. I’ve tweaked our 16GB/6 core server, but can’t seem to get much faster than 2-3 posts/second, meaning the entire rebake will take well over 20 days.
Rebaking starts with the oldest posts, I’d prefer to start with the most recent ones to give our community the best possible experience (assuming that the newest posts will get most traffic).
There’s no way to ‘resume’ the process where it left off, and I have reasons to suspect I’ll need to rebuild at least once during the next 20 days.
Rebake jobs go into the default sidekiq queue and I’m concerned that this will create huge delays for regular processing jobs.

So far, I’ve done the following: after digging around in the code and getting some assistance from the staff here, I’ve hacked lib/tasks/posts.rake to:

Work in chronological reverse order, starting at the most recent posts.
Ignore private messages - I want to prioritise public topics first
Output the current post/topic ID so I can easily add to the where clause of my query to resume processing at another post number.

Here’s my code:

def rebake_posts(opts = {})
  puts "NEW Rebaking post markdown for '#{RailsMultisite::ConnectionManagement.current_db}'"

  disable_edit_notifications = SiteSetting.disable_edit_notifications
  SiteSetting.disable_edit_notifications = true

  total = Post.count
  rebaked = 0

    ordered_post_ids = Post.joins(:topic)
      .select('posts.id')
      .where('topics.archetype' => Archetype.default)
      .order("posts.id DESC")
      .pluck(:id)

    ordered_post_ids.in_groups_of(1000).each do |post_ids|
    posts = Post.order(created_at: :desc).where(id:post_ids)
    posts.each do |post|
      rebake_post(post, opts)
      print_status(rebaked += 1, total)
      puts " > rebaking post id #{post.id} for topic id #{post.topic_id}"
    end
  end

  SiteSetting.disable_edit_notifications = disable_edit_notifications

  puts "", "#{rebaked} posts done!", "-" * 50
end

Next up: I’m figuring out how to create these jobs in the low priority queue. Hints would be most welcome

bartv · April 8, 2018, 12:43pm

Now I’ve started my first large test, I noticed that the jobs processing has made several huge ‘steps’ in speed. I suspect this may have to do with a large number of my attached images having been moved to the tombstone - this is another ongoing project.

pfaffman · April 8, 2018, 1:12pm

This sounds like an improvement. Perhaps submit a PR.

And it may make sense to do something such that you don’t have to rebske and un-tombstone.

bartv · April 8, 2018, 4:19pm

The recover_from_tombstone script is a bit problematic - I’ve discovered several issues with it. I’ll report on those later.

codinghorror · April 8, 2018, 6:26pm

Yes this is very dumb, however it appears Rails / ActiveRecord has no concept of descending ID order when iterating through records, apparently.

bartv · April 8, 2018, 6:35pm

Yes I learned that too With the help of your team I figured out how to work around it though. I’m not sure this is a smart or even fast way of doing it, but it works for me.

bartv · April 8, 2018, 7:46pm

Next issue: our new site will already go live while the posts:rebake job is running. Will having a large number of jobs in the default queue slow down regular site processes, and should I try to have posts:rebake start its jobs in the low priority queue instead? Or is this automatically handled?

So far, it seems that the queue that a job will be created in is a property of the job’s class, I’m not sure I could influence this in some way from within the posts.rake script?

If not, I’ll throttle the creation of new jobs to make sure the queue isn’t filling up.

riking · April 8, 2018, 8:24pm

I think there’s also a ‘version’ column on the posts table that you can null out to cause gradual rebaking, too. I think it does 100 posts every time the job triggers.

codinghorror · April 8, 2018, 11:22pm

Does that version rebake task go in newest posts first order @sam?

sam · April 8, 2018, 11:25pm

Yes it does, changed that a while back:

github.com

discourse/discourse/blob/142571bba010eedbdfc1452d42beccc72389c373/app/models/post.rb#L480-L504


      
          def self.rebake_old(limit)
            problems = []
            Post.where('baked_version IS NULL OR baked_version < ?', BAKED_VERSION)
              .order('id desc')
              .limit(limit).pluck(:id).each do |id|
              begin
                post = Post.find(id)
                post.rebake!
              rescue => e
                problems << { post: post, ex: e }
          
                attempts = post.custom_fields["rebake_attempts"].to_i
          
                if attempts > 3
                  post.update_columns(baked_version: BAKED_VERSION)
                  Discourse.warn_exception(e, message: "Can not rebake post# #{p.id} after 3 attempts, giving up")
                else
                  post.custom_fields["rebake_attempts"] = attempts + 1
                  post.save_custom_fields
                end

This file has been truncated. show original

Limit is still 100 @riking but can be configured per:

github.com

discourse/discourse/blob/b87205831bf3d6c6226f628bc91dfa6d04534630/config/site_settings.yml#L1169-L1171


      
          rebake_old_posts_count:
            default: 100
            min: 1

pfaffman · April 9, 2018, 12:54am

So rather than running rake posts:rebake, one should instead do Posts.all.update_all('baked_version: null') and all posts will be rebaked in batches according to rebake_old_posts_count?

codinghorror · April 9, 2018, 7:06am

We should normalize the rake task to go in descending ID order as well @techapj. Unless this is super hard, many hours of work, or something?

sam · April 9, 2018, 7:17am

Agree, but it is a bit tricky cause we would need to carry a big list of ids in memory. I wonder if we should amend it so the rake task is resumable?

Have rake posts:rebake reset version and just work through old posts using calls to rebake_old

And add rake posts:rebake:resume that simply resumes an interrupted rebake.

Downside here is that posts:rebake would unconditionally cause posts to rebake at some point in time even if the task is interrupted, but this may not matter.

codinghorror · April 9, 2018, 7:37am

Is carrying a list of integer IDs in memory really that expensive?

sam · April 9, 2018, 7:40am

we can probably live with it to be honest … that retains the tasks working exactly as they do today (in reverse order). Though something in me wants these tasks to be resumable cause if you are working through 20 million posts this can take many hours and if it breaks half way through it can be very frustrating to start from scratch.

codinghorror · April 9, 2018, 7:40am

Maybe V1 can be the simple version with a comment

// TODO: make this resumable because carrying around 20 million ids in memory is not a great idea long term

techAPJ · April 9, 2018, 6:53pm

Done via:

https://github.com/discourse/discourse/commit/adb93716ca7776d6f8bbf8f2680ede45fb267b4e

neil · April 9, 2018, 7:10pm

I’ve used a script that was resumable at the topic level by using the custom fields. Here’s one that skips private messages (since my import had a LOT of them and they weren’t a priority):

Topic.includes(:_custom_fields).where(archetype: Archetype.default).find_each do |t|
  unless t.custom_fields["import_rebake"].present?
    t.posts.select(:id).find_each do |post|
      Jobs.enqueue(:process_post, {post_id: post.id, bypass_bump: true, cook: true})
    end
    t.custom_fields["import_rebake"] = Time.zone.now
    t.save
  end
end

(This filled up Sidekiq’s default queue, so it’s not useful if you want to launch your site before the rebakes are completed.)

After they’re all done, all the TopicCustomField records with name “import_rebake” can be deleted.

riking · April 9, 2018, 7:45pm

Yes, and @bartv would be able to get his “rebuild for just one topic” by doing:

Posts.where(topic_id: 1234).update_all('baked_version = NULL')

bartv · April 9, 2018, 7:57pm

What’s the frequency of these new batches, and how can you monitor the progress?

Topic		Replies	Views
Slow rebake script to avoid overwhelming your server Migration	2	41	July 3, 2025
Rake posts:rebake_incremental (feature request) Feature	2	766	April 22, 2020
Rebake all posts? Installation	16	8845	October 28, 2022
Rebake with rails command or rake task doesn't work, but rebuilding HTML does. Why? Installation	11	1835	April 26, 2022
Rebuild HTML for entire topic Feature	6	3834	April 8, 2018

My journey into a massive posts rebake job

Related topics