My journey into a massive posts rebake job

(Bart) #7

Next issue: our new site will already go live while the posts:rebake job is running. Will having a large number of jobs in the default queue slow down regular site processes, and should I try to have posts:rebake start its jobs in the low priority queue instead? Or is this automatically handled?

So far, it seems that the queue that a job will be created in is a property of the job’s class, I’m not sure I could influence this in some way from within the posts.rake script?

If not, I’ll throttle the creation of new jobs to make sure the queue isn’t filling up.

0 Likes

(Kane York) #8

I think there’s also a ‘version’ column on the posts table that you can null out to cause gradual rebaking, too. I think it does 100 posts every time the job triggers.

4 Likes

(Jeff Atwood) #9

Does that version rebake task go in newest posts first order @sam?

0 Likes

(Sam Saffron) #10

Yes it does, changed that a while back:

Limit is still 100 @riking but can be configured per:

3 Likes

(Jay Pfaffman) #11

So rather than running rake posts:rebake, one should instead do Posts.all.update_all('baked_version: null') and all posts will be rebaked in batches according to rebake_old_posts_count?

2 Likes

(Jeff Atwood) #12

We should normalize the rake task to go in descending ID order as well @techapj. Unless this is super hard, many hours of work, or something?

1 Like

(Sam Saffron) #13

Agree, but it is a bit tricky cause we would need to carry a big list of ids in memory. I wonder if we should amend it so the rake task is resumable?

Have rake posts:rebake reset version and just work through old posts using calls to rebake_old

And add rake posts:rebake:resume that simply resumes an interrupted rebake.

Downside here is that posts:rebake would unconditionally cause posts to rebake at some point in time even if the task is interrupted, but this may not matter.

2 Likes

(Jeff Atwood) #14

Is carrying a list of integer IDs in memory really that expensive?

1 Like

(Sam Saffron) #15

we can probably live with it to be honest … that retains the tasks working exactly as they do today (in reverse order). Though something in me wants these tasks to be resumable cause if you are working through 20 million posts this can take many hours and if it breaks half way through it can be very frustrating to start from scratch.

5 Likes

(Jeff Atwood) #16

Maybe V1 can be the simple version with a comment

// TODO: make this resumable because carrying around 20 million ids in memory is not a great idea long term

6 Likes

(Arpit Jalan) #17

Done via:

4 Likes

(Neil Lalonde) #18

I’ve used a script that was resumable at the topic level by using the custom fields. Here’s one that skips private messages (since my import had a LOT of them and they weren’t a priority):

Topic.includes(:_custom_fields).where(archetype: Archetype.default).find_each do |t|
  unless t.custom_fields["import_rebake"].present?
    t.posts.select(:id).find_each do |post|
      Jobs.enqueue(:process_post, {post_id: post.id, bypass_bump: true, cook: true})
    end
    t.custom_fields["import_rebake"] = Time.zone.now
    t.save
  end
end

(This filled up Sidekiq’s default queue, so it’s not useful if you want to launch your site before the rebakes are completed.)

After they’re all done, all the TopicCustomField records with name “import_rebake” can be deleted.

5 Likes

(Kane York) #19

Yes, and @bartv would be able to get his “rebuild for just one topic” by doing:

Posts.where(topic_id: 1234).update_all('baked_version = NULL')
4 Likes

(Bart) #20

What’s the frequency of these new batches, and how can you monitor the progress?

2 Likes

(Arpit Jalan) #21

This is now done via:

https://github.com/discourse/discourse/commit/536cef86f4d0a3526d33fd3feb54f03bead7fdd4

We no longer carry post ids in memory and the rebake task can be resumed by running posts:rebake_uncooked_posts.

One caveat here is that the resume task will not rebake posts in reverse order (i.e. the sort order will be id ascending).

6 Likes

(Clay Heaton) #22

So @techAPJ, if I need to trigger a rebake of every post on a Discourse install, is @pfaffman’s method the proper one to use?

0 Likes

(Arpit Jalan) #23

If you need to rebake all posts instantly then run bundle exec rake posts:rebake.

Post.update_all("baked_version = NULL") will rebake 100 posts (by default) every 15 minutes.

4 Likes

Error "Killed" while running rake posts:rebake
(Clay Heaton) #24

Thanks, Arpit.

FYI, I encountered some performance issues with that approach, so I went with this, which alleviated the problem and resulted in the same outcome:

Post.in_batches.update_all('baked_version = NULL')

5 Likes

[bounty] Google+ (private ) communities: export screenscraper + importer
#26

@techAPJ I have a dummy question. Where do you run this command? After entering the app?

It tells me

bash: syntax error near unexpected token ''baked_version = NULL''

0 Likes

(Arpit Jalan) #27
./launcher enter app
rails c
Post.in_batches.update_all('baked_version = NULL')
6 Likes