大規模な投稿の再構築作業への私の旅

bartv · 2018 年 4 月 8 日午後 12:25

この会話は’Rebuild HTML for entire topic’から続いています。私の実験がかなり異なる方向に進んでいるため、進捗に合わせて考えや結果を共有することに価値があると思ったからです。

私の状況は以下の通りです。400 万件以上の投稿がある新しい移行済みフォーラムのリリースを目前に控えています。最終ドメインへ切り替える際にこれらの投稿の再ビルド（rebake）が必要となり、画像が正しく埋め込まれているかを確認するための処理も必要になります。

懸念点は以下の通りです：

再ビルドは高速なプロセスではありません。16GB/6 コアのサーバーを調整しましたが、2〜3 投稿/秒以上にはできず、全体の再ビルドには 20 日以上を要する見込みです。
再ビルドは最も古い投稿から開始されますが、コミュニティにとって最良の体験を提供するために（最新の投稿が最も多くのトラフィックを受けることを想定）、最新の投稿から開始したいと考えています。
処理を中断した場所から再開する手段がなく、今後 20 日以内に少なくとも一度は再構築を行う必要があると疑われる理由があります。
再ビルドジョブはデフォルトの Sidekiq キューに入りますが、これが通常の処理ジョブに大きな遅延を引き起こすことを懸念しています。

これまでのところ、以下の作業を行いました：コードを調査し、ここでスタッフの支援を得た後、lib/tasks/posts.rake をハックして以下の点を変更しました：

時系列の逆順で動作し、最新の投稿から開始する。
プライベートメッセージを無視する（まずは公開トピックを優先するため）。
現在の投稿/トピック ID を出力し、クエリの where 句に追加して、別の投稿番号から処理を再開できるようにする。

以下が私のコードです：

def rebake_posts(opts = {})
  puts "NEW Rebaking post markdown for '#{RailsMultisite::ConnectionManagement.current_db}'"

  disable_edit_notifications = SiteSetting.disable_edit_notifications
  SiteSetting.disable_edit_notifications = true

  total = Post.count
  rebaked = 0

    ordered_post_ids = Post.joins(:topic)
      .select('posts.id')
      .where('topics.archetype' => Archetype.default)
      .order("posts.id DESC")
      .pluck(:id)

    ordered_post_ids.in_groups_of(1000).each do |post_ids|
    posts = Post.order(created_at: :desc).where(id:post_ids)
    posts.each do |post|
      rebake_post(post, opts)
      print_status(rebaked += 1, total)
      puts " > rebaking post id #{post.id} for topic id #{post.topic_id}"
    end
  end

  SiteSetting.disable_edit_notifications = disable_edit_notifications

  puts "", "#{rebaked} posts done!", "-" * 50
end

次のステップ：これらのジョブを低優先度キューでどのように作成するかを検討しています。ヒントがあれば大歓迎です

bartv · 2018 年 4 月 8 日午後 12:43

Now I’ve started my first large test, I noticed that the jobs processing has made several huge ‘steps’ in speed. I suspect this may have to do with a large number of my attached images having been moved to the tombstone - this is another ongoing project.

pfaffman · 2018 年 4 月 8 日午後 1:12

This sounds like an improvement. Perhaps submit a PR.

And it may make sense to do something such that you don’t have to rebske and un-tombstone.

bartv · 2018 年 4 月 8 日午後 4:19

The recover_from_tombstone script is a bit problematic - I’ve discovered several issues with it. I’ll report on those later.

codinghorror · 2018 年 4 月 8 日午後 6:26

Yes this is very dumb, however it appears Rails / ActiveRecord has no concept of descending ID order when iterating through records, apparently.

bartv · 2018 年 4 月 8 日午後 6:35

Yes I learned that too With the help of your team I figured out how to work around it though. I’m not sure this is a smart or even fast way of doing it, but it works for me.

bartv · 2018 年 4 月 8 日午後 7:46

Next issue: our new site will already go live while the posts:rebake job is running. Will having a large number of jobs in the default queue slow down regular site processes, and should I try to have posts:rebake start its jobs in the low priority queue instead? Or is this automatically handled?

So far, it seems that the queue that a job will be created in is a property of the job’s class, I’m not sure I could influence this in some way from within the posts.rake script?

If not, I’ll throttle the creation of new jobs to make sure the queue isn’t filling up.

riking · 2018 年 4 月 8 日午後 8:24

I think there’s also a ‘version’ column on the posts table that you can null out to cause gradual rebaking, too. I think it does 100 posts every time the job triggers.

codinghorror · 2018 年 4 月 8 日午後 11:22

Does that version rebake task go in newest posts first order @sam?

sam · 2018 年 4 月 8 日午後 11:25

Yes it does, changed that a while back:

github.com/discourse/discourse

app/models/post.rb

142571bba


      
          def self.rebake_old(limit)
            problems = []
            Post.where('baked_version IS NULL OR baked_version < ?', BAKED_VERSION)
              .order('id desc')
              .limit(limit).pluck(:id).each do |id|
              begin
                post = Post.find(id)
                post.rebake!
              rescue => e
                problems << { post: post, ex: e }
          
                attempts = post.custom_fields["rebake_attempts"].to_i
          
                if attempts > 3
                  post.update_columns(baked_version: BAKED_VERSION)
                  Discourse.warn_exception(e, message: "Can not rebake post# #{p.id} after 3 attempts, giving up")
                else
                  post.custom_fields["rebake_attempts"] = attempts + 1
                  post.save_custom_fields
                end

This file has been truncated. show original

Limit is still 100 @riking but can be configured per:

github.com/discourse/discourse

config/site_settings.yml

b87205831


      
          rebake_old_posts_count:
            default: 100
            min: 1

pfaffman · 2018 年 4 月 9 日午前 12:54

So rather than running rake posts:rebake, one should instead do Posts.all.update_all('baked_version: null') and all posts will be rebaked in batches according to rebake_old_posts_count?

codinghorror · 2018 年 4 月 9 日午前 7:06

We should normalize the rake task to go in descending ID order as well @techapj. Unless this is super hard, many hours of work, or something?

sam · 2018 年 4 月 9 日午前 7:17

Agree, but it is a bit tricky cause we would need to carry a big list of ids in memory. I wonder if we should amend it so the rake task is resumable?

Have rake posts:rebake reset version and just work through old posts using calls to rebake_old

And add rake posts:rebake:resume that simply resumes an interrupted rebake.

Downside here is that posts:rebake would unconditionally cause posts to rebake at some point in time even if the task is interrupted, but this may not matter.

codinghorror · 2018 年 4 月 9 日午前 7:37

Is carrying a list of integer IDs in memory really that expensive?

sam · 2018 年 4 月 9 日午前 7:40

we can probably live with it to be honest … that retains the tasks working exactly as they do today (in reverse order). Though something in me wants these tasks to be resumable cause if you are working through 20 million posts this can take many hours and if it breaks half way through it can be very frustrating to start from scratch.

codinghorror · 2018 年 4 月 9 日午前 7:40

Maybe V1 can be the simple version with a comment

// TODO: make this resumable because carrying around 20 million ids in memory is not a great idea long term

techAPJ · 2018 年 4 月 9 日午後 6:53

Done via:

https://github.com/discourse/discourse/commit/adb93716ca7776d6f8bbf8f2680ede45fb267b4e

neil · 2018 年 4 月 9 日午後 7:10

I’ve used a script that was resumable at the topic level by using the custom fields. Here’s one that skips private messages (since my import had a LOT of them and they weren’t a priority):

Topic.includes(:_custom_fields).where(archetype: Archetype.default).find_each do |t|
  unless t.custom_fields["import_rebake"].present?
    t.posts.select(:id).find_each do |post|
      Jobs.enqueue(:process_post, {post_id: post.id, bypass_bump: true, cook: true})
    end
    t.custom_fields["import_rebake"] = Time.zone.now
    t.save
  end
end

(This filled up Sidekiq’s default queue, so it’s not useful if you want to launch your site before the rebakes are completed.)

After they’re all done, all the TopicCustomField records with name “import_rebake” can be deleted.

riking · 2018 年 4 月 9 日午後 7:45

Yes, and @bartv would be able to get his “rebuild for just one topic” by doing:

Posts.where(topic_id: 1234).update_all('baked_version = NULL')

bartv · 2018 年 4 月 9 日午後 7:57

What’s the frequency of these new batches, and how can you monitor the progress?

トピック		返信	表示
Rake posts:rebake_incremental (feature request) Feature	2	802	2020 年 4 月 22 日
Rebake all posts? Self-hosting	16	9074	2022 年 10 月 28 日
Slow rebake script to avoid overwhelming your server Migration	4	157	2025 年 11 月 13 日
Rebake with rails command or rake task doesn't work, but rebuilding HTML does. Why? Self-hosting	11	1960	2022 年 4 月 26 日
Rebuild HTML for entire topic Feature	6	3908	2018 年 4 月 8 日

大規模な投稿の再構築作業への私の旅

関連トピック