Rebake All Posts Optimization

We have slightly under 30 million posts in our post table. This is making a rebake operation (after importing from a different forum software and making markdown modifications) rather untenable. I’m looking at how to optimize this.

This particular call is taking hours to complete on my current test system:

Post.update_all('baked_version = NULL')

What exactly does this flag do? The field is an integer but is it necessary to reset it for all posts before we do any rebake operation? As I said, this single call takes hours to run because of the size of our post table. I have slightly modified the rebake code to instead make this call for each post right before the rebake is called. Is there a better way?

Also, what appears to be the biggest hit to performance is the use of OFFSET in the code. This line causes huge issues:

Post.order(id: :desc).offset(i).limit(batch).each do |post|

When you get into the millions of posts, this becomes a problem because offset requires that the entire table up to and including the current offset is processed. The engine then discards everything below the offset range before returning the end result set. This causes the call to take several minutes even when only a few million posts in. The number of times this happens is based on the batch size (hardcoded to 1000 by default).

I realize the delay will vary based on the hardware configuration as well, but I think this can be improved so that throwing more hardware at the issue doesn’t need to be the answer.

Instead I have modified this code to operate based on a BETWEEN operator on the range of IDs being processed. My understanding is that this shouldn’t change the logic of a rebake at all, but instead just improve the SQL query to something much faster.

In fact, in my case, I am seeing a 3 order of magnitude (1000x) improvement by making this change:

(0..(total).abs).step(batch) do |i|
      Post.order(id: :desc).where('id BETWEEN ? AND ?', i, (i + batch - 1)).each do |post|

i is the current batch index and BETWEEN is inclusive, so I exclude the last post in each batch so nothing gets double processed. This should get all the posts but it’s possible I have a slight logic error somewhere…

Here is the difference in processing time between the original and new version of the end query:

Original:

                                                                         QUERY PLAN                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=12541237.13..12544635.84 rows=1000 width=783) (actual time=289375.555..289456.445 rows=1000 loops=1)
   ->  Index Scan Backward using posts_pkey on posts  (cost=0.56..100746846.76 rows=29642680 width=783) (actual time=0.038..288798.236 rows=3691000 loops=1)
         Filter: (deleted_at IS NULL)
 Planning Time: 0.175 ms
 JIT:
   Functions: 6
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 80.019 ms, Inlining 47.933 ms, Optimization 261.731 ms, Emission 106.263 ms, Total 495.947 ms
 Execution Time: 289538.294 ms
(9 rows)

Notice the number of rows in the Index Scan on the actual time section: 3691000
This is nearly a 5 minute query.

Here is the new version using BETWEEN:

                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan Backward using posts_pkey on posts  (cost=0.56..4137.79 rows=1100 width=783) (actual time=1.956..273.467 rows=1001 loops=1)
   Index Cond: ((id >= 3690000) AND (id <= 3691000))
   Filter: (deleted_at IS NULL)
 Planning Time: 26.421 ms
 Execution Time: 274.035 ms
(5 rows)

Only the 1000 rows of interest are examined and returned. Runs in 274 ms instead of 289,000 ms.

This is a nice improvement, but it would be even better if the process could be multithreaded. That’s likely a bit above my expertise but I might be able to devise some way to run multiple rebake processes simultaneously using the MOD of the ID to segment the post table across the various processes.

Feedback is welcome. Can this be improved further?

2 Likes

give it to me GIF

You can just issue a post.rebake! for every post that needs a rebake.
No need to mess with baked_version.

So that piece of code can be removed from the rake task? Why would it be there if it isn’t needed?

As far as I recall, the baked_version is there to cater for rebaking in case the Discourse rebake logic changed, (in contrast to when the raw content of the post has been changed).

1 Like