Mi viaje en un trabajo masivo de reactivar publicaciones

bartv · 8 Abril, 2018 12:25

Estoy continuando esta conversación desde ‘Reconstruir HTML para todo el tema’, ya que mis experimentos están tomando un rumbo bastante diferente y pensé que podría ser valioso compartir mis reflexiones y resultados a medida que avanzo.

Mi situación es la siguiente: estamos a punto de lanzar un nuevo foro migrado con más de 4 millones de publicaciones. Estas requerirán una recocción (rebake) cuando cambiemos al dominio final, y las publicaciones necesitan ser procesadas para asegurar que las imágenes se incrusten correctamente, etc.

Mis preocupaciones son:

La recocción no es un proceso rápido. He ajustado nuestro servidor de 16 GB/6 núcleos, pero no logro ir más rápido que 2-3 publicaciones/segundo, lo que significa que toda la recocción tomará bien más de 20 días.
La recocción comienza con las publicaciones más antiguas; preferiría empezar con las más recientes para ofrecer a nuestra comunidad la mejor experiencia posible (asumiendo que las publicaciones más nuevas recibirán más tráfico).
No hay forma de ‘reanudar’ el proceso donde se dejó, y tengo razones para sospechar que necesitaré reconstruir al menos una vez durante los próximos 20 días.
Los trabajos de recocción van a la cola predeterminada de Sidekiq y me preocupa que esto genere enormes retrasos para los trabajos de procesamiento habituales.

Hasta ahora, he hecho lo siguiente: después de investigar en el código y recibir asistencia del personal aquí, he modificado lib/tasks/posts.rake para:

Trabajar en orden cronológico inverso, comenzando con las publicaciones más recientes.
Ignorar los mensajes privados: quiero priorizar los temas públicos primero.
Mostrar el ID actual de la publicación/tema para poder agregar fácilmente a la cláusula where de mi consulta y reanudar el procesamiento en otro número de publicación.

Aquí está mi código:

def rebake_posts(opts = {})
  puts "NEW Recociendo el markdown de la publicación para '#{RailsMultisite::ConnectionManagement.current_db}'"

  disable_edit_notifications = SiteSetting.disable_edit_notifications
  SiteSetting.disable_edit_notifications = true

  total = Post.count
  rebaked = 0

    ordered_post_ids = Post.joins(:topic)
      .select('posts.id')
      .where('topics.archetype' => Archetype.default)
      .order("posts.id DESC")
      .pluck(:id)

    ordered_post_ids.in_groups_of(1000).each do |post_ids|
    posts = Post.order(created_at: :desc).where(id:post_ids)
    posts.each do |post|
      rebake_post(post, opts)
      print_status(rebaked += 1, total)
      puts " > recociendo la publicación id #{post.id} para el tema id #{post.topic_id}"
    end
  end

  SiteSetting.disable_edit_notifications = disable_edit_notifications

  puts "", "#{rebaked} publicaciones realizadas!", "-" * 50
end

A continuación: estoy averiguando cómo crear estos trabajos en la cola de baja prioridad. Cualquier pista sería muy bienvenida

bartv · 8 Abril, 2018 12:43

Now I’ve started my first large test, I noticed that the jobs processing has made several huge ‘steps’ in speed. I suspect this may have to do with a large number of my attached images having been moved to the tombstone - this is another ongoing project.

pfaffman · 8 Abril, 2018 13:12

This sounds like an improvement. Perhaps submit a PR.

And it may make sense to do something such that you don’t have to rebske and un-tombstone.

bartv · 8 Abril, 2018 16:19

The recover_from_tombstone script is a bit problematic - I’ve discovered several issues with it. I’ll report on those later.

codinghorror · 8 Abril, 2018 18:26

Yes this is very dumb, however it appears Rails / ActiveRecord has no concept of descending ID order when iterating through records, apparently.

bartv · 8 Abril, 2018 18:35

Yes I learned that too With the help of your team I figured out how to work around it though. I’m not sure this is a smart or even fast way of doing it, but it works for me.

bartv · 8 Abril, 2018 19:46

Next issue: our new site will already go live while the posts:rebake job is running. Will having a large number of jobs in the default queue slow down regular site processes, and should I try to have posts:rebake start its jobs in the low priority queue instead? Or is this automatically handled?

So far, it seems that the queue that a job will be created in is a property of the job’s class, I’m not sure I could influence this in some way from within the posts.rake script?

If not, I’ll throttle the creation of new jobs to make sure the queue isn’t filling up.

riking · 8 Abril, 2018 20:24

I think there’s also a ‘version’ column on the posts table that you can null out to cause gradual rebaking, too. I think it does 100 posts every time the job triggers.

codinghorror · 8 Abril, 2018 23:22

Does that version rebake task go in newest posts first order @sam?

sam · 8 Abril, 2018 23:25

Yes it does, changed that a while back:

github.com/discourse/discourse

app/models/post.rb

142571bba


      
          def self.rebake_old(limit)
            problems = []
            Post.where('baked_version IS NULL OR baked_version < ?', BAKED_VERSION)
              .order('id desc')
              .limit(limit).pluck(:id).each do |id|
              begin
                post = Post.find(id)
                post.rebake!
              rescue => e
                problems << { post: post, ex: e }
          
                attempts = post.custom_fields["rebake_attempts"].to_i
          
                if attempts > 3
                  post.update_columns(baked_version: BAKED_VERSION)
                  Discourse.warn_exception(e, message: "Can not rebake post# #{p.id} after 3 attempts, giving up")
                else
                  post.custom_fields["rebake_attempts"] = attempts + 1
                  post.save_custom_fields
                end

This file has been truncated. show original

Limit is still 100 @riking but can be configured per:

github.com/discourse/discourse

config/site_settings.yml

b87205831


      
          rebake_old_posts_count:
            default: 100
            min: 1

pfaffman · 9 Abril, 2018 00:54

So rather than running rake posts:rebake, one should instead do Posts.all.update_all('baked_version: null') and all posts will be rebaked in batches according to rebake_old_posts_count?

codinghorror · 9 Abril, 2018 07:06

We should normalize the rake task to go in descending ID order as well @techapj. Unless this is super hard, many hours of work, or something?

sam · 9 Abril, 2018 07:17

Agree, but it is a bit tricky cause we would need to carry a big list of ids in memory. I wonder if we should amend it so the rake task is resumable?

Have rake posts:rebake reset version and just work through old posts using calls to rebake_old

And add rake posts:rebake:resume that simply resumes an interrupted rebake.

Downside here is that posts:rebake would unconditionally cause posts to rebake at some point in time even if the task is interrupted, but this may not matter.

codinghorror · 9 Abril, 2018 07:37

Is carrying a list of integer IDs in memory really that expensive?

sam · 9 Abril, 2018 07:40

we can probably live with it to be honest … that retains the tasks working exactly as they do today (in reverse order). Though something in me wants these tasks to be resumable cause if you are working through 20 million posts this can take many hours and if it breaks half way through it can be very frustrating to start from scratch.

codinghorror · 9 Abril, 2018 07:40

Maybe V1 can be the simple version with a comment

// TODO: make this resumable because carrying around 20 million ids in memory is not a great idea long term

techAPJ · 9 Abril, 2018 18:53

Done via:

neil · 9 Abril, 2018 19:10

I’ve used a script that was resumable at the topic level by using the custom fields. Here’s one that skips private messages (since my import had a LOT of them and they weren’t a priority):

Topic.includes(:_custom_fields).where(archetype: Archetype.default).find_each do |t|
  unless t.custom_fields["import_rebake"].present?
    t.posts.select(:id).find_each do |post|
      Jobs.enqueue(:process_post, {post_id: post.id, bypass_bump: true, cook: true})
    end
    t.custom_fields["import_rebake"] = Time.zone.now
    t.save
  end
end

(This filled up Sidekiq’s default queue, so it’s not useful if you want to launch your site before the rebakes are completed.)

After they’re all done, all the TopicCustomField records with name “import_rebake” can be deleted.

riking · 9 Abril, 2018 19:45

Yes, and @bartv would be able to get his “rebuild for just one topic” by doing:

Posts.where(topic_id: 1234).update_all('baked_version = NULL')

bartv · 9 Abril, 2018 19:57

What’s the frequency of these new batches, and how can you monitor the progress?

Tema		Respuestas	Vistas
Rake posts:rebake_incremental (feature request) Feature	2	799	22 Abril 2020
Rebake all posts? Self-hosting	16	9067	28 Octubre 2022
Slow rebake script to avoid overwhelming your server Migration	4	146	13 Noviembre 2025
Rebake with rails command or rake task doesn't work, but rebuilding HTML does. Why? Self-hosting	11	1958	26 Abril 2022
Rebuild HTML for entire topic Feature	6	3900	8 Abril 2018

Mi viaje en un trabajo masivo de reactivar publicaciones

Temas relacionados