How does post tracking work in Discourse

Discourse tracks read time for every post users see on the screen. This system has evolved over the years and I find I often need to refer back to the code to figure out how this works and why it exists.

This post covers the painful technical details of the current implementation.

How the Discourse client tracks timing

Post timing tracking is implemented in screen-track.js.es6. This module is responsible for tracking how long a post has been on the screen and how long the topic has been on screen.

When a topic page “scrolls” it will inform the screen tracker what posts it has in view AND which of these posts have been read. We consider partially in view posts as “in view”.

Screen track then fires a “tick” every second that decides what data needs to be sent to the server.

The screen tracker will keep track of multiple lists.

a. A list of (post/time spent reading post) that has not been sent to server
b. A list of posts that we know were read
c. A list of posts that we know are on the screen right now

At the start of a tick (every second) if we have posts in (a) we will consider sending them to the server:

  • If SiteSetting flush_timing_secs (default 60 secs) has passed since the last time we sent data to the server.

  • If any of the posts are “unread” by the user we will send the entire list right away

At the end of a tick if Discourse has focus:

If we have any “posts on the screen”, we will log “1 tick” of time for each post

If at any point we leave the topic (navigate to another place in Discourse)

We will send everything that is “in flight” in (a) right away to the server.

Limits

  • Each time you look at a topic we will log a maximum of 6 minutes reading time per post (this will reset if you navigate away and back to the post)

  • If 3 minutes pass and you have not scrolled at all, we disable this subsystem until stuff scrolls again

  • We will log timing for up to 5 topics for anonymous users (which is converted when the user signs up to data in the posts_timing table)

Key observation

  1. Even though the post_timings table track down to the millisecond we have between “0-1000ms” of “unlogged” time per post, depending on when the tick fires.

  2. Each “session” of looking at a topic can log up to 6 minutes of read time per post. There is no upper limit on read time per post, a post can be read for days by a user if the user returns to a topic.

What do we do with this data?

The most critical piece of information we use is “did user X read post Y”, this determines unread counts on topic and tons of other critical data.

Except for the binary use we use the time logged in post_timings to calculate avg_time for a post.

Average time for a post is calculated as the exponent of the average of the natural log of the time (aka geometric mean).

So for example:

Post 1: sam, 10 seconds
Post 2: jane, 1 hour

avg_time = exp((log(3600000) + log(10000)) / 2)
=~ exp((15.09 + 9.2) / 2)
=~ 189094
=~ 189 seconds

This avg_time is then used in score calculator as a component for “post score”.

Score = 5 * reply_count + 15 * like_score + 5 * incoming link count + 2 * bookmark_count + 0.05 * avg_time + 0.2 * post reads.

So in the case above 189 seconds on avg reading a post translates to 37 points. So… roughly 2 likes an a bit. Or 72 reads.

“post score” is uses then for “best of” to figure out what the best posts are in a topic.

12 Likes

Isn’t this read data the backing store for these numbers, total read time?

3 Likes

Yes the “topic read time” I will update the OP to explain about that, I touched on it very lightly.

The topic_users table has a column called total_msecs_viewed. This number is updated independent of post_timings in the same controller action. We can not “rebuild” that number from post timings cause we have no idea about overlapping times.

The “topic timing” piece has no 6 minute limit like post timing does. The number is flushed with the same post timing batch according to the same rules.

I think I was so focused on talking about post tracking that I missed out on explaining the topic tracking part.

4 Likes

Whoa… we do? Do we really need to do this? It feels unnecessary?

What do you mean by “overlapping times”?

1 Like

I have not tested this recently, but yes the code is all there to do it. I guess it means you can get to TL1 a bit faster.

Say you we know about my timings:

Post 1: 10 seconds
Post 2: 12 seconds
Post 3: 17 seconds

The time I spent reading the topic can be anywhere between 17 seconds and 39 seconds.

So we can not use the data in the posts timing table to figure out what the number in topic users should be. So we are forced to track that other number independently.

It is not a huge deal and makes no big diff, but there is no way to “run an inventory” and check that the number in topic user is 100% correct.

3 Likes