我在一个大型帖子重建工作中的旅程

bartv · 2018 年4 月 8 日 12:25

我将继续从“重建整个主题的 HTML”开始的对话，因为我的实验正朝着完全不同的方向发展，我认为在过程中分享我的想法和结果可能会有价值。

我的情况如下：我们即将推出一个迁移后的新论坛，包含超过 400 万篇帖子。在切换到最终域名时，这些帖子需要重新烘焙，并且还需要对帖子进行处理，以确保图片正确嵌入等。

我的担忧包括：

重新烘焙不是一个快速的过程。我调整了我们 16GB/6 核的服务器配置，但似乎无法将速度提升到超过每秒 2-3 篇帖子，这意味着整个重新烘焙过程将需要超过 20 天。
重新烘焙从最旧的帖子开始，而我更希望从最新的帖子开始，以便为我们的社区提供尽可能好的体验（假设最新帖子将获得最多流量）。
无法在中断处“恢复”该过程，而且我有理由怀疑在接下来的 20 天内至少需要重新构建一次。
重新烘焙任务会进入默认的 Sidekiq 队列，我担心这会给常规处理任务造成巨大延迟。

到目前为止，我已经做了以下工作：在深入代码并从这里的员工那里获得一些帮助后，我修改了 lib/tasks/posts.rake 文件，使其：

按时间倒序工作，从最新的帖子开始。
忽略私信——我希望优先处理公开主题。
输出当前的帖子/主题 ID，以便我可以轻松地将查询的 where 子句添加到另一个帖子编号以恢复处理。

这是我的代码：

def rebake_posts(opts = {})
  puts "NEW 正在为 '#{RailsMultisite::ConnectionManagement.current_db}' 重新烘焙帖子 Markdown"

  disable_edit_notifications = SiteSetting.disable_edit_notifications
  SiteSetting.disable_edit_notifications = true

  total = Post.count
  rebaked = 0

    ordered_post_ids = Post.joins(:topic)
      .select('posts.id')
      .where('topics.archetype' => Archetype.default)
      .order("posts.id DESC")
      .pluck(:id)

    ordered_post_ids.in_groups_of(1000).each do |post_ids|
    posts = Post.order(created_at: :desc).where(id:post_ids)
    posts.each do |post|
      rebake_post(post, opts)
      print_status(rebaked += 1, total)
      puts " > 正在重新烘焙帖子 ID #{post.id}，主题 ID #{post.topic_id}"
    end
  end

  SiteSetting.disable_edit_notifications = disable_edit_notifications

  puts "", "#{rebaked} 篇帖子完成！", "-" * 50
end

下一步：我正在研究如何将这些任务创建到低优先级队列中。非常欢迎提供任何提示

bartv · 2018 年4 月 8 日 12:43

Now I’ve started my first large test, I noticed that the jobs processing has made several huge ‘steps’ in speed. I suspect this may have to do with a large number of my attached images having been moved to the tombstone - this is another ongoing project.

pfaffman · 2018 年4 月 8 日 13:12

This sounds like an improvement. Perhaps submit a PR.

And it may make sense to do something such that you don’t have to rebske and un-tombstone.

bartv · 2018 年4 月 8 日 16:19

The recover_from_tombstone script is a bit problematic - I’ve discovered several issues with it. I’ll report on those later.

codinghorror · 2018 年4 月 8 日 18:26

Yes this is very dumb, however it appears Rails / ActiveRecord has no concept of descending ID order when iterating through records, apparently.

bartv · 2018 年4 月 8 日 18:35

Yes I learned that too With the help of your team I figured out how to work around it though. I’m not sure this is a smart or even fast way of doing it, but it works for me.

bartv · 2018 年4 月 8 日 19:46

Next issue: our new site will already go live while the posts:rebake job is running. Will having a large number of jobs in the default queue slow down regular site processes, and should I try to have posts:rebake start its jobs in the low priority queue instead? Or is this automatically handled?

So far, it seems that the queue that a job will be created in is a property of the job’s class, I’m not sure I could influence this in some way from within the posts.rake script?

If not, I’ll throttle the creation of new jobs to make sure the queue isn’t filling up.

riking · 2018 年4 月 8 日 20:24

I think there’s also a ‘version’ column on the posts table that you can null out to cause gradual rebaking, too. I think it does 100 posts every time the job triggers.

codinghorror · 2018 年4 月 8 日 23:22

Does that version rebake task go in newest posts first order @sam?

sam · 2018 年4 月 8 日 23:25

Yes it does, changed that a while back:

github.com/discourse/discourse

app/models/post.rb

142571bba


      
          def self.rebake_old(limit)
            problems = []
            Post.where('baked_version IS NULL OR baked_version < ?', BAKED_VERSION)
              .order('id desc')
              .limit(limit).pluck(:id).each do |id|
              begin
                post = Post.find(id)
                post.rebake!
              rescue => e
                problems << { post: post, ex: e }
          
                attempts = post.custom_fields["rebake_attempts"].to_i
          
                if attempts > 3
                  post.update_columns(baked_version: BAKED_VERSION)
                  Discourse.warn_exception(e, message: "Can not rebake post# #{p.id} after 3 attempts, giving up")
                else
                  post.custom_fields["rebake_attempts"] = attempts + 1
                  post.save_custom_fields
                end

This file has been truncated. show original

Limit is still 100 @riking but can be configured per:

github.com/discourse/discourse

config/site_settings.yml

b87205831


      
          rebake_old_posts_count:
            default: 100
            min: 1

pfaffman · 2018 年4 月 9 日 00:54

So rather than running rake posts:rebake, one should instead do Posts.all.update_all('baked_version: null') and all posts will be rebaked in batches according to rebake_old_posts_count?

codinghorror · 2018 年4 月 9 日 07:06

We should normalize the rake task to go in descending ID order as well @techapj. Unless this is super hard, many hours of work, or something?

sam · 2018 年4 月 9 日 07:17

Agree, but it is a bit tricky cause we would need to carry a big list of ids in memory. I wonder if we should amend it so the rake task is resumable?

Have rake posts:rebake reset version and just work through old posts using calls to rebake_old

And add rake posts:rebake:resume that simply resumes an interrupted rebake.

Downside here is that posts:rebake would unconditionally cause posts to rebake at some point in time even if the task is interrupted, but this may not matter.

codinghorror · 2018 年4 月 9 日 07:37

Is carrying a list of integer IDs in memory really that expensive?

sam · 2018 年4 月 9 日 07:40

we can probably live with it to be honest … that retains the tasks working exactly as they do today (in reverse order). Though something in me wants these tasks to be resumable cause if you are working through 20 million posts this can take many hours and if it breaks half way through it can be very frustrating to start from scratch.

codinghorror · 2018 年4 月 9 日 07:40

Maybe V1 can be the simple version with a comment

// TODO: make this resumable because carrying around 20 million ids in memory is not a great idea long term

techAPJ · 2018 年4 月 9 日 18:53

Done via:

https://github.com/discourse/discourse/commit/adb93716ca7776d6f8bbf8f2680ede45fb267b4e

neil · 2018 年4 月 9 日 19:10

I’ve used a script that was resumable at the topic level by using the custom fields. Here’s one that skips private messages (since my import had a LOT of them and they weren’t a priority):

Topic.includes(:_custom_fields).where(archetype: Archetype.default).find_each do |t|
  unless t.custom_fields["import_rebake"].present?
    t.posts.select(:id).find_each do |post|
      Jobs.enqueue(:process_post, {post_id: post.id, bypass_bump: true, cook: true})
    end
    t.custom_fields["import_rebake"] = Time.zone.now
    t.save
  end
end

(This filled up Sidekiq’s default queue, so it’s not useful if you want to launch your site before the rebakes are completed.)

After they’re all done, all the TopicCustomField records with name “import_rebake” can be deleted.

riking · 2018 年4 月 9 日 19:45

Yes, and @bartv would be able to get his “rebuild for just one topic” by doing:

Posts.where(topic_id: 1234).update_all('baked_version = NULL')

bartv · 2018 年4 月 9 日 19:57

What’s the frequency of these new batches, and how can you monitor the progress?

话题		回复	浏览量
Rake posts:rebake_incremental (feature request) Feature	2	812	2020 年4 月 22 日
Rebake all posts? Self-hosting	14	9132	2019 年5 月 13 日
Slow rebake script to avoid overwhelming your server Migration	4	169	2025 年11 月 13 日
Rebake with rails command or rake task doesn't work, but rebuilding HTML does. Why? Self-hosting	9	1990	2022 年3 月 27 日
Rebuild HTML for entire topic Feature	6	3927	2018 年4 月 8 日

我在一个大型帖子重建工作中的旅程

相关话题