Bug or bot? A 'new' user showing as read a lot of posts

Someone who has joined the forum only two weeks ago has shot up in the Users listing page when sorting by read time.

(Second person in list)

I find it hard to believe that someone will have gone through threads spanning over a year has actually read all the posts - read time is 2 hours.

Any opinions on this? Think the member has visited all threads then just quickly scrolled through them? Are there any protections against this kind of thing in DC (should there be?)

2 Likes

I noticed a similar thing on my board awhile ago. Think it’s a bug cause the stats don’t match on /admin/users/list/active or anywhere else.

1 Like

Nothing looks out of the ordinary in the logs :confused:

Also this is the stats from the ACP (seem to be similar to the public list)

I can truly understand the bafflement here.

I would suggest reaching out to the user via PM, introduce yourself, see how they be doin’ and gauge their interest. Some community managers do this as part of their remuneration strategy (ie, give a user a reason to stay and become a member).

The responses (or lack despite the user logging in and reading still) may help you gauge exactly what is going on behind those numbers.

6 Likes

Just bouncing this so @sam and @mcwumbly see it. Posts “read” is unfortunately a bit of a game-able stat, which is unfortunate:

  • create a new test account
  • go to topic list and sort by posts descending
  • open a high post count topic - but - DO NOT read, simply jump to the bottom
  • repeat for a few more high post topics
  • wait a while to make sure the job has run and check the User Directory
  • compare against Admin User page

End result:

image

@neil also said:

That’s how we count posts read:

SELECT SUM(posts_read) FROM user_visits

When you are shown a post in a topic, this updates the posts_read:

user.update_posts_read!(post_number - before_last_read)

So an example in a topic with 37450 posts and you just to the end:

user.update_posts_read!(37450 - 1)

Pretty bad!

8 Likes

It’s probably possible to tweak how those read counts are counted to use the post_timings table instead, similar to how “topics viewed” are calculated in this query (as opposed to posts_read, which suffers from the same issue, currently):

tv as (
 select user_id, 
        count(distinct(topic_id)) as topics_viewed
  from topic_views, t
  where viewed_at > t.start
  and viewed_at < t.end
  group by user_id
),

(In addition a minimum on the time viewed to count as “read” could be factored in).

I realize this probably isn’t something that you’d want to “just do” since there are likely performance considerations to consider. Just thinking aloud about possible ways out in the future.

Except the post_timings table doesn’t track when the posts were viewed. I think the source of the problem is in topic_user.rb:

user.update_posts_read!(post_number - before_last_read, mobile: opts[:mobile])

It should be counting how many of those posts qualify for counting. If topic is not private message, only count regular posts (not whispers, etc.) between post numbers before_last_read and post_number, etc. And don’t count more than the count of PostTiming records the user has in the topic. I’ll give it a try.

6 Likes

I think I solved this one.

https://github.com/discourse/discourse/commit/87ec11e298e950203966a988fae4d1a9e197f9d7

We had the data when storing post timings, but weren’t using it. There are still a few cases where the count won’t be 100% accurate, but it’s so much better than counting every post in a topic as read.

As for repairing existing stats… :confounded: Maybe we don’t?

10 Likes

On repairing stats… looking at topics entered vs posts read on a popular Discourse, I see:

image

I think it might be safe to only cap users with an average posts read per topic of greater than 50 as those are very very likely to be erroneous outliers. And ignore all other users who are mostly in range.

The only time that might get weird is someone who only entered 2 giant topics but actually read every post in them and this seems… unlikely.

Anyway whatever is easy and safe is fine by me.

3 Likes

The only way that I can think of to get a more accurate read count is to only count posts that have been displayed within the y-axis boundaries. (and even then, displayed does not necessarily equate to read).

However, tracking y-axis coordinates would be very expensive and IMHO the cost wouldn’t be worth any benefit it might give in terms of being more accurate. Or does the “blue read dot” already do this?

Similar to the “time taken to post”, factoring in a “time taken to read” would not be 100% perfect, but as long as it has a minimum high enough to account for a valid “skim” it would be an improvement and a fair compromise.

1 Like

I’ll go with Jeff’s plan to cap posts read to 50 * topics_entered.

^ That’s a user with 1 topic entered. Stats like that are obviously wrong, so this “fix” will change it to have 50 posts read.

5 Likes

Is there anything left to do here?

1 Like

All done. I’ll close it.

3 Likes