Mysterious entries in topic_users table

I’m running a Data Explorer query to see which users read a specific post:

-- [params]
-- integer :topic = 0

SELECT *, total_msecs_viewed / (1000 * 60) as mins_read
FROM topic_users
WHERE topic_id = :topic
ORDER BY total_msecs_viewed DESC

The results look plausible, except for a mysterious nonexisting user:

This is the result for topic 200. The user id 200 is not used in my installation (I have fewer users).
The same thing is happening for all other topics, there is always an entry for a “user” whose id matches the topic id.

Am I overlooking a very obvious bug in my SQL code? Is this entry really there? Should it be?

1 Like

If you run

SELECT * 
FROM users 
WHERE id = 200

you might be able to spot the problem.

My guess would be a NULL name field

This returns 0 results, as expected.

So you know it’s a deleted account?

In that case you might want a more complex query that takes that into it.

1 Like

No, the account never existed – I only have about 150 users, so the id has never been used.

Hmmm, is that the MathJax topic id?

SELECT * 
FROM topics 
WHERE id = 200

Yes, it is. As I said above, this happens in multiple (or all?) topics. Topic 201 has a mysterious user 201, and so on.

1 Like

So you did.

I was getting some “invalid int” errors until I changed “:topic” to “:topicid” so I at first thought it might involve how “params” works.

But even after removing that and hard-coding in the ids, I still got the mysterious result.

I notice it doesn’t seem to happen when the id is 55 or below, but always for ids 56 or greater.

55 seems like an odd value for a switch to be happening.

Are your results consistent with <=55 OK, >=56 Not ?

EDIT
Hmmm, 55 is my highest users id + 2
(yes, I have that many test accounts on my localhost site)

1 Like

With the possibility of deleting users, 55 could plausibly be the highest user id.

Are you sure this isn’t happening for lower user IDs? The entries would look quite normal…

I tried to look at all entries where the user id equals the topic id:

SELECT *
FROM topic_users
WHERE user_id != topic_id
ORDER BY user_id

Even for low ids, the results look strange: Almost all have highest_seen_post_number = NULL, for example. In other rows in this table, this column almost always contains a plausible date.

I’m starting to believe that this is a bug It looks like something is inserting rows here and mistakes the topic id for the user id?

4 Likes

I’ve been looking at topic_users a bit lately.

I was confused as to why there would be these two fields

last_read_post_number    integer    null 
highest_seen_post_number    integer    null 

I mean, if I read it, I saw it, no?

Then I found this in app/models/topic_user.rb

      # In case anyone seens "highest_seen_post_number" and gets confused, like I do.
      # highest_seen_post_number represents the highest_post_number of the topic when
      # the user visited it. It may be out of alignment with last_read, meaning
      # ... user visited the topic but did not read the posts

In other words, if a topic has 100 posts, but I only read the first 50, the two will not be the same.

I just ran

SELECT *
FROM topic_users
WHERE last_read_post_number IS NULL 
OR  highest_seen_post_number IS NULL

In all results both read and seen were NULL
and I noticed in all results total_msecs_viewed was zero.

In some, it was not the topic id that was used, but the post id

My guess, without having looked more closely into the code yet, is that there is a “divide by 0” error happening somewhere.

3 Likes

That’s some additional weirdness – thanks for the research!
But even a division through zero doesn’t really explain the nonexistent
user ids… :frowning:

I’m not sure if it’s more to do with Discourse or Data Explorer or both.

I’ll try running some queries in the postgres console tomorrow (I need to eat and get some sleep) to see if the results are similar.

1 Like

An update after an interesting morning.

I was wrong about two things

This was wrong, I was misled by a seemingly unrelated weirdness. (that sent me off on a tangent for a good while)

What I had assumed to be post ids are in fact topic ids.

For whatever unknown reason both my topics and posts tables “skipped” ids.

In the topics table ids are serial from 1 - 166, then skipped to 1796, 1797, then skipped to 1830 and have been sequential serial values since.

Likewise the posts table ids are 1 - 188, skip to 2708, skip to 2747 then sequential since

I don’t know what happened, but I was exploring the differences between
“Global Notice” vs. “Banner Topic” vs. “Global Pinned Topic” vs. “Pinned Topic” so maybe that has something to do with it.

Topics
166	Pinned Stats query          		^^^^^
                        2016-05-04 07:12
1796	Banner Poll results
                        2016-05-06 01:40
1797	Limit Banner Use?
                        2016-05-06 07:15
1830	Topic for moved PM testing		vvvvvvvv
                        2016-05-06 20:48

Posts
888	Pinned Stats query (3)      		    ^^^^^
                        2016-05-04 08:42
2708	Banner testing - take two (13)      7d unbannered may 6 unpinned
                        2016-05-05 06:34
2747	Going off the grid for a while (35)	vvvvvvvv
                        2016-05-05 09:19

* created at datetimes

Anyway, I was also wrong about

Topic ids were actually being used as user ids, but because there were actual users with those id values it didn’t look obviously wrong at first.


The incorrect user ids are in the Postgres table and not a result of how Data Explorer works with the data

 user_id | topic_id | posted | last_read_post_number | highest_seen_post_number |
---------+----------+--------+-----------------------+--------------------------+
      56 |       56 | f      |                       |                          |
      
 last_visited_at       |      first_visited_at      | notification_level |
----------------------------+----------------------------+--------------------+
 2016-05-12 17:47:06.711945 | 2016-03-14 19:18:55.588447 |                  1 |

 notifications_changed_at | notifications_reason_id | total_msecs_viewed |
--------------------------+-------------------------+--------------------+
      |                         |                  0 |

 cleared_pinned_at | id  | last_emailed_post_number | liked | bookmarked
-------------------+-----+--------------------------+-------+------------
      | 434 |                          | f     | f

I don’t know how or why the topic ids could be getting used as user ids, but my feeling is it has to do with “no owner” posts. eg.
system = user id -1
“small actions”
anonymized accounts

I was alarmed when I saw that a March “post closed Flagged” message to FluffGuy1 with topic id 52
had RoyalGuy1 user id 52 as a topic_user.
But luckily RoyalGuy1 has no indication of it’s existence displayed anywhere, he is not included as a user in the message, and going to the URL gives “no permission”.

5 Likes

Awesome work, thanks!

So this definitely looks like a bug, I’ll move the topic.

1 Like

Yes, I think it’s a bug somewhere.

I’ve found a differentiation area

SELECT COUNT(dtid) 
FROM (SELECT DISTINCT id FROM topics) AS dtid
count = 182

SELECT COUNT(dtuid) 
FROM (SELECT DISTINCT topic_id FROM topic_users) AS dtuid
count = 182

SELECT COUNT(dtuntid) 
FROM (SELECT DISTINCT topic_id FROM topic_users WHERE last_read_post_number IS NULL) AS dtuntid
count = 170

SELECT COUNT(dtuptid) 
FROM (SELECT DISTINCT topic_id FROM topic_users WHERE user_id = topic_id) AS dtuptid
count = 170

The difference between the 182 / 170 / 12 ?

My install came with 12 “seeded” topics

id	title				 user
1	About the Lounge category	 system
2	About the Meta category		 system
3	About the Staff category	 system
4	Terms of Service		 system
5	FAQ/Guidelines			 system
6	Privacy Policy			 system
7	Assets for the site design	 system
8	Welcome to Discourse		 system
9	Welcome to the Lounge		 system
10	READ ME FIRST: Admin Quick Start Guide	 system
11	This is a test topic		 eviltrout
12	Welcome to Discourse!	 	 system
1 Like

I had AdminGuy1 create a new topic in the meta category
I then went directly to the Data Explorer and ran
SELECT * FROM topic_users ORDER BY topic_id DESC LIMIT 10
members that were Watching / Tracking the meta category were there, and the last id had a user_id = topic_id

I then had Mittineague create a new category - Temporary
I did not visit the new category’s generated About page
Going to Data Explorer and running the same query gave the same results
i.e. no new rows had been added to the topic_users table

I then viewed - aka visited - the new category’s About page but did not edit it.
Going to Data Explorer and running the same query showed that 2 rows had been added to the topic_users table.
The first was Mittineague.
The last was another user_id = topic_id

This suggested the “when”. and after poring through lines of code I think I may have found the “where”.

app/models/topic_user.rb lines 128 - 139 are

    def track_visit!(topic,user)
      topic_id = topic.is_a?(Topic) ? topic.id : topic
      user_id = user.is_a?(User) ? user.id : topic

      now = DateTime.now
      rows = TopicUser.where(topic_id: topic_id, user_id: user_id).update_all(last_visited_at: now)
      if rows == 0
        TopicUser.create(topic_id: topic_id, user_id: user_id, last_visited_at: now, first_visited_at: now)
      else
        observe_after_save_callbacks_for topic_id, user_id
      end
    end

I am far from being a Ruby expert, and I do not know all the inter-relationships in Discourse, but in particular line # 130 looks like the probable culprit. i.e.
user_id = user.is_a?(User) ? user.id : topic

That is, if I’m reading the code correctly, the line translates roughly to
if user is a User assign user.id to user_id else assign topic to user_id

If this is the problem, I don’t know what the best way to deal with it would be.

4 Likes

Preeeetty sure it should be user instead of topic here.

Awesome detective work :mag:

2 Likes

Just pushed a fix. Not sure how it got there in the first place :frowning:

https://github.com/discourse/discourse/commit/6137bb46d31f0bb3ef3a4488c68a17a750ddb84e

7 Likes

Should there be a migration that removes the invalid records from the table?

3 Likes

I was thinking of that too

First thought would be
WHERE topic_id = user_id
except there may be cases where an actual member with the same id as a topic should be there

Adding notification_level = 1 (assigned as the default) and

last_read_post_number IS NULL 
highest_seen_post_number IS NULL

would only falsely delete members that had set the category as Normal via the category drop-down which seems like that would be very rare, but is possible I guess.

The safest would be

topic_id > MAX(users.id)
```. 
but depending on how many members there are could leave quite a few.
2 Likes