数据探索器查询：列出“估计阅读时间”最长的主题？

Richie · 2021 年7 月 11 日 09:39

Hi everyone,

Is it possible to create a data explorer query which can list the “top X” topics by “estimated read time” ?

I’d love to know which topics on our Discourse are the most time consuming to read

(And off topic, further to this post by @simon , I can’t seem to add a data-explorer tag to this post?)

simon · 2021 年7 月 12 日 23:34

I think the approach I suggested in that topic needs to be improved. One problem with it is that only TL3 and above users can tag posts on Meta. That means that the majority of the site’s users wouldn’t be able to follow my instructions. The other issue is that we’ll end up with both unanswered topics and answered topics having the data-explorer tag. That won’t help much with searching for queries.

simon · 2021 年8 月 13 日 21:22

Sorry for the late answer. I got caught up in the question about how to organize Data Explorer queries on the site. Using the data-explorer tag seems like the ideal solution, but topics that contain a Data Explorer query will need to be tagged by a user with TL3 status.

I think that something like the following query will give you the information you’re looking for:

SELECT
topic_id,
category_id,
SUM(total_msecs_viewed) / 60000  AS estimated_minutes_read
FROM topic_users tu
JOIN topics t ON t.id = tu.topic_id
WHERE t.deleted_at IS NULL
AND t.archetype = 'regular'
GROUP BY tu.topic_id, category_id
ORDER BY estimated_minutes_read DESC
LIMIT 100

The LIMIT 100 statement in the last line of the query could be adjusted or removed if you want more results to be returned.

Interestingly, the topic with by far the most recorded read time on Meta is Setup DiscourseConnect - Official Single-Sign-On for Discourse (sso). It’s currently at 126048 minutes.

Richie · 2021 年8 月 16 日 12:41

Hi @simon

Is that formula correct?

If I pick four or five at random, and compare the estimated read time column result in this query, to the estimated read time in the topic itself, I’m getting two very different numbers?

JammyDodger · 2021 年8 月 16 日 13:51

It looks like that query gives you the Topics that have been read for the most time, rather than the Topics that take the longest to read?

Richie · 2021 年8 月 16 日 14:23

Ah, that could explain the issue.

I’m guessing total_msecs_viewed is the wrong column to use here?

michebs · 2021 年8 月 16 日 15:28

You can use the average time that the users take to read the topic.
In that case, you can just change the SUM function to AVG, it would look like this:

SELECT
    topic_id,
    category_id,
    AVG(total_msecs_viewed) / 60000  AS estimated_minutes_read
FROM topic_users tu
JOIN topics t ON t.id = tu.topic_id
WHERE t.deleted_at IS NULL
AND t.archetype = 'regular'
GROUP BY tu.topic_id, category_id
ORDER BY estimated_minutes_read DESC
LIMIT 100

Richie · 2021 年8 月 17 日 06:17

Thanks for the suggestion @michebs but I’m afraid that one is way off the mark too.

A few examples:

What the query says	What the topic says
438	61
353	58
335	40
196	24

JammyDodger · 2021 年8 月 17 日 06:30

But that would mean, on average, a person takes 438 minutes to read that top topic? That seems unlikely. This may sound silly, but did you have enough 0s in 60,000?

Edit: Or maybe the AVG includes all the re-readings of a topic too? So once through would be 61 minutes, but actually users spend on average 438 minutes in there.

Though now I’m quite interested to know how the Estimated Read Time is worked out for the Summary, as ideally you’d want those to match. Even shrinking those by a factor of ten would only ballpark it.

Richie · 2021 年8 月 17 日 06:56

Yes, exactly

JammyDodger · 2021 年8 月 17 日 08:33

I had a little search, and found this "There are 84 replies with an estimated read time of 0 minutes." - #9 by nbianca.

I struggle with deciphering these things, but it seems it’s uses a word count x time figure (plus a minimum time to cover posts with no words, like images).

There was also this one which gave a hint as to what the final value may be called: (though its old, so may have changed?)

[missing {{count}} value]

summary:
  description: "There are <b>{{replyCount}}</b> replies."
  description_time: "There are <b>{{replyCount}}</b> replies with an estimated read time of <b>{{readingTime}} minutes</b>."

Not massively helpful, I’m sure, but thought I’d share just in case.

I hope you get the answers you’re looking for.

JammyDodger · 2021 年11 月 26 日 21:23

我再看了一下，它似乎（最简单的形式）是 topic.word_count 乘以“阅读时间字数”管理员设置（默认为 500 字/分钟）。所以我认为这个查询将产生阅读时间最长的前 X 个主题：

-- [params]
-- integer :limit = 10

SELECT t.id as topic_id, (t.word_count)/500+1 AS estimated_read_time
FROM topics t
WHERE t.word_count IS NOT NULL
AND t.archetype = 'regular'
ORDER BY t.word_count DESC
LIMIT :limit

不过，也有另一种选择“4 秒最短”：（帖子数 x 4）/60。这是为了考虑没有字数的图片主题。所以它同时计算两者，并显示较大的那个。但我还没有完全弄清楚如何将其添加进去。

不幸的是，我的网站不够大，无法对其进行充分测试。它在小样本测试中似乎有效，但可能需要调整。

编辑： 我添加了一个“limit”参数，使其更接近 OP 的规格。

Richie · 2021 年11 月 26 日 21:42

太棒了，我想他明白了！

@JammyDodger 我运行了你的查询，这里有几张截图供参考。

首先是“前10名”：

果然如此：

有几个数字不太匹配，但非常接近了！

JammyDodger · 2021 年11 月 26 日 21:51

看来我肯定需要弄清楚如何添加照片。我还没有放弃。

JammyDodger · 2021 年11 月 27 日 10:49

我又试了一次。我对此不是100%确定，因为我没有足够大的样本来测试它，但它已经抓取了我的测试主题。

-- [params]
-- integer :limit = 10

WITH read_time AS (
SELECT t.id as topic_id,
(t.word_count)/500+1 as word_count_time,
(t.posts_count*4)/60+1 as post_count_time
FROM topics t
WHERE t.word_count IS NOT NULL
AND t.archetype = 'regular'
AND t.deleted_at IS NULL
)

SELECT topic_id, CONCAT (CASE WHEN word_count_time > post_count_time THEN word_count_time ELSE post_count_time END, ' min') AS estimated_reading_time
FROM read_time
ORDER BY estimated_reading_time DESC
LIMIT :limit

话题		回复	浏览量
Showing read time in the top categories pane in summary Feature	1	542	2018 年8 月 2 日
Discourse query for total read time Data & reporting sql-query	5	546	2023 年9 月 21 日
Get total list of topics and their view counts from Discourse API Dev	4	1762	2024 年12 月 4 日
Run Data Explorer queries with the Discourse API Integrations data-explorer , rest-api , how-to	14	13105	2025 年5 月 6 日
Is there any way to show approximate reading time in the topics list? Feature	11	1661	2021 年5 月 11 日

数据探索器查询：列出“估计阅读时间”最长的主题？

相关话题