By looking at the top 20…50 words used by a user, I hope to identify who cares about what the most, and then use the information to engage them in right topics and ask them right questions for better motivation.
What tables / columns would be useful for this task?
I suspect I should play with post_search_data.search_data, which is a tsvector, i.e. words and their frequencies per post. But if you think there’s a better way, please advise!
users are ordered by the number of words they posted in the last 90 days, most active posters first
only those messages are included that were posted in the last 90 days
instead of counting word tokens, the query counts lexems, so that “dogs” and “dog” are interpreted as the same word
Application
Control your language. I noticed I use the word “flag” too often. Maybe not a good habit for an admin who tries to be friendly!
Find who cares about what the most. For instance:
From the screenshot, I can see that user shcher86 used word cheese 517 times, cottage cheese 133 times and recipe 202 times. Now I know what to ask her about
Admin Brovarchanka really cares about goats (353), and also uses the :wink: smile all the time. I should care more about users and goats too, rather than flags!
SQL Query
WITH
posters AS (
SELECT SUM(p.word_count) as total_words,
u.*
FROM users u
LEFT JOIN posts p ON p.user_id = u.id
WHERE p.created_at > now()::date - 91
GROUP BY u.id
)
SELECT u.id as user_id,
(SELECT string_agg(word || ' - ' || nentry, ' ') FROM
(SELECT * FROM ts_stat('
SELECT search_data
FROM post_search_data psd
LEFT JOIN posts p ON p.id = psd.post_id
WHERE p.created_at > now()::date - 91
AND p.user_id = ' || u.id)
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 60
) freq
) as top_words
FROM posters u
ORDER BY u.total_words DESC
LIMIT 100
Tuning
LIMIT 100 limits to the top 100 posters; adjust to your needs
p.created_at > now()::date - 91 limits analysis to posts in last 90 days:
91 = 90 days
90 = 89 days
… and so on
You have to change it in 2 places in the query
LIMIT 60 limits output to 60 lexemes, that’s where you may change it if needs be
Enjoy learning your users’ language!
P.S. And don’t ask your users to flag too often
-- [params]
-- int :past_x_days = 90
-- int :top_x_words = 60
-- int :top_x_users = 100
WITH posters AS (
SELECT
sum (p.word_count ) AS total_words,
u.*
FROM
users u
LEFT JOIN posts p ON p.user_id = u.id
WHERE
p.created_at > now() ::date - ( :past_x_days + 1 )
GROUP BY
u.id
)
SELECT
u.id AS user_id,
(
SELECT
string_agg ( word || ' - ' || nentry, ' ' )
FROM (
SELECT
*
FROM
ts_stat ('
SELECT search_data FROM post_search_data psd
LEFT JOIN posts p
ON p.id = psd.post_id
WHERE p.created_at > now()::date - ' || (
:past_x_days + 1 )
|| ' AND p.user_id = ' || u.id
)
ORDER BY
nentry DESC,
ndoc DESC,
word
LIMIT :top_x_words
)
AS freq
)
AS top_words
FROM
posters AS u
ORDER BY
u.total_words DESC
LIMIT :top_x_users
A really cool query, but I’ve noticed some unexpected results.
Words from the Category and Topic tiles are being included in top_words, 1 time for each post made in the topic.
In other words, if the category is “ABC” and the topic is “XYZ” if a member makes 25 posts in various topics in the ABC category including 10 in the XXZ topic, even if they never made a post with either ABC or XYZ in it, the count will show as ABC - 25, XYZ - 10
On a positive note, words inside quotes, img tags and links don’t appear to be included.
I’ve been messing with queries using posts.raw and posts.cooked but it’s tricky business to not include words in quotes, image tags etc. i.e. non-posted words - Note, this is a very buggy example regexp_matches(posts.raw, '\[quote(?:[^]])*\][^[]*\[\/quote]', 'g')
I’m thinking it might be easier to subtract category and title words from those found in the post_search_data table.
As an example this for the default “Staff About” search_data 'admin':11 'categori':2,17 'discuss':5 'moder':13 'privat':1 'staff':4,16,18 'topic':6 'visibl':9
raw_data
Private category for staff discussions. Topics are only visible to admins and moderators. About the Staff category Staff
Yes, there are a few words that I guess aren’t included because they aren’t considered helpful as search terms.
I can see the logic for not including “for”, “are”, “only”, “to”, “and”, “the” and the like.
But I am a bit surprised “about” isn’t considered a word that might be searched for
lol That’s what I get for being lazy and typing only “Staff About” in hopes you would know what I meant.
The category is - Staff (18)
The topic title is - About the Staff category (14-17)
The post content is - Private category for staff discussions. Topics are only visible to admins and moderators. (1-13)
I wouldn’t say it’s entirely useless. Only that the results are corrupt.
I was thinking of trying some kind of query like
word count = word count - ((category title + topic title words) x number of posts in topic)
but my Postgres chops aren’t all that great so it will take me some time if ever to come up with it.
Still, I like using post_search_data over posts as I think it would be easier to deal with the cat and topic titles than it would be to deal with quotes, images, links etc. that are in raw / cooked
Unsure if I should start a new topic or not but… would it be possible to tweak this SQL to get a list of the X most common words used in the last Y posts, regardless of user? I’m trying to look for common technical words I can use to create a dictionary on my forum.