Performance improvements on long topics?

There was another spot that has been fixed and should resolve the following problem:

What is the deal with performance as the size of a topic grows?

The first load performance of a topic degrades as a topic gets larger and is due to the fact that we have to send down every single post id when we load a topic. This would naturally raise a red flag for anyone reading this for the first time but after reading through the code base, I’ve concluded that this is a trade off that we’re making in order to be able to have clean and accurate implementation for the topic timeline.

Screenshot%20from%202018-07-11%2017-26-45

If we look at the two numbers on our topic timeline and translate them into their respective DB queries, we would get the following:

SELECT posts.*
FROM (
  SELECT posts.*, ROW_NUMBER() OVER () AS row_number 
  FROM posts 
  WHERE posts.deleted_at IS NULL 
  AND posts.topic_id = 1 
  AND posts.post_type in (1, 2, ,3, 4)
) AS posts
SELECT COUNT(*) 
FROM posts 
WHERE posts.deleted_at IS NULL 
AND posts.topic_id = 1 AND posts.post_type in (1, 2, ,3, 4)

For the first query, we basically need the DB to fetch all the posts for the given filters before we can figure out the “index” of each post in the stream.

The second query is a count which naturally becomes expensive as the topic grows. A counter cache would not work here because the count changes based on what filters are applied to the topic.

In addition, having the entire stream of post ids present on the client side makes the following features straight forward to implement:

  1. Jumping to a certain index in the stream,
  2. Selecting all posts below a certain index
  3. Fetching excepts while scrolling through a stream.

Attempts were made to re-implement (1) and (3) above to work without the stream of post ids but that either made the feature inaccurate or complicated the client side code so much that I felt it wasn’t worth the trade off

Stopping the bleeding on MEGATOPICs (> 10_000 posts)

Megatopics are expensive to load for three main reasons:

  1. To calculate all the gaps within a topic, a query is run to fetch all the posts.id for posts regardless of whether the posts have been deleted or not.
  2. To generate the stream of all post ids, a query is run to fetch all the posts.id for posts given a set of filters.
  3. First load time suffers because the client has to download the stream of post ids which has a length that is greater than 10_000.

Our plan to stop the bleeding here is to drop/approximate certain features on Megatopics:

  1. The ability to display the closest date for a given index is dropped on megatopics
    Screenshot%20from%202018-07-12%2008-03-19

  2. Gaps are not supported on megatopics

  3. Loading excerpts while scrolling through the timeline is not supported

    Screenshot%20from%202018-07-12%2008-06-56

  4. The numbers on the timeline becomes an approximation. Instead of the index of the post in the set of possible results, we use Post#post_number of the post. Instead of a count of all the possible results, we use the Topic#highest_post_number.

  5. Jumping to an index on the timeline becomes jumping to the closest post number on the timeline

This may seem like taking a step backwards but do note that MEGATOPICs are quite rate in the wild. We’ve mainly seen them appear on sites with imported content since normal sites would hit the SiteSetting#auto_close_topics_post_count guard that is in place. For more information on why we think Megatopics are bad, you can read @codinghorror’s analysis about it.

9 Likes