SQL: The most N used words per user (speak their language!)

By looking at the top 20…50 words used by a user, I hope to identify who cares about what the most, and then use the information to engage them in right topics and ask them right questions for better motivation.

What tables / columns would be useful for this task?


I suspect I should play with post_search_data.search_data, which is a tsvector, i.e. words and their frequencies per post. But if you think there’s a better way, please advise!

post_search_data.search_data” is fine if you want to work on lexemes since

a “tsvector” value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word.

If you want to work on raw words, then it’s better to use “posts.raw:wink:

3 Likes

So, here it is:

“Speak their language” SQL query

This query fetches 60 lexemes that are most frequently used by users in their messages from last 90 days, per user.

Notes

  • users are ordered by the number of words they posted in the last 90 days, most active posters first
  • only those messages are included that were posted in the last 90 days
  • instead of counting word tokens, the query counts lexems, so that “dogs” and “dog” are interpreted as the same word

Application

  • Control your language. I noticed I use the word “flag” too often. Maybe not a good habit for an admin who tries to be friendly!
  • Find who cares about what the most. For instance:
    • From the screenshot, I can see that user shcher86 used word cheese 517 times, cottage cheese 133 times and recipe 202 times. Now I know what to ask her about :wink:
    • Admin Brovarchanka really cares about goats (353), and also uses the :wink: smile all the time. I should care more about users and goats too, rather than flags!

SQL Query

WITH

posters AS (
  SELECT SUM(p.word_count) as total_words,
    u.*
    
  FROM users u
  LEFT JOIN posts p ON p.user_id = u.id
  WHERE p.created_at > now()::date - 91
  GROUP BY u.id
)


SELECT u.id as user_id,
  
  (SELECT string_agg(word || ' - ' || nentry, ' ') FROM

    (SELECT * FROM ts_stat('
    
      SELECT search_data
      FROM post_search_data psd
      LEFT JOIN posts p ON p.id = psd.post_id
      WHERE p.created_at > now()::date - 91
        AND p.user_id = ' || u.id)

    ORDER BY nentry DESC, ndoc DESC, word
    LIMIT 60
    ) freq
  ) as top_words
  
  
FROM posters u
ORDER BY u.total_words DESC
LIMIT 100

Tuning

  • LIMIT 100 limits to the top 100 posters; adjust to your needs
  • p.created_at > now()::date - 91 limits analysis to posts in last 90 days:
    91 = 90 days
    90 = 89 days
    … and so on
    You have to change it in 2 places in the query
  • LIMIT 60 limits output to 60 lexemes, that’s where you may change it if needs be

Enjoy learning your users’ language!
P.S. And don’t ask your users to flag too often :wink:

11 Likes

Well done and thanks @meglio, interesting results across a few Discourse instances.

I turned this into a parametrised query for Data Explorer (@riking awesome job on this plugin!)

most-used-words-per-user.dcquery (1).json (1.8 KB)

-- [params]
-- int :past_x_days = 90
-- int :top_x_words = 60
-- int :top_x_users = 100

WITH posters AS (
    SELECT
        sum (p.word_count ) AS total_words,
        u.*
    FROM
        users u
    LEFT JOIN posts p ON p.user_id = u.id
WHERE
    p.created_at > now() ::date - ( :past_x_days + 1 )
GROUP BY
    u.id
)

SELECT
    u.id AS user_id,
    (
        SELECT
            string_agg ( word || ' - ' || nentry, ' ' )
        FROM (
                SELECT
                    *
                FROM
                    ts_stat ('
                            SELECT search_data FROM post_search_data psd
                            LEFT JOIN posts p
                            ON p.id = psd.post_id
                            WHERE p.created_at > now()::date - ' || (
                            :past_x_days + 1 )
                        || ' AND p.user_id = ' || u.id
                    )
                ORDER BY
                    nentry DESC,
                    ndoc DESC,
                    word
                LIMIT :top_x_words
        )
        AS freq
    )
    AS top_words
FROM
    posters AS u
ORDER BY
    u.total_words DESC
LIMIT :top_x_users
8 Likes

A really cool query, but I’ve noticed some unexpected results.

Words from the Category and Topic tiles are being included in top_words, 1 time for each post made in the topic.

In other words, if the category is “ABC” and the topic is “XYZ” if a member makes 25 posts in various topics in the ABC category including 10 in the XXZ topic, even if they never made a post with either ABC or XYZ in it, the count will show as
ABC - 25, XYZ - 10

On a positive note, words inside quotes, img tags and links don’t appear to be included.

2 Likes

I don’t extract words on my own - I use topic search data vector generated in PostgreSQL that Discourse builds for search purposes.

I’ve been messing with queries using posts.raw and posts.cooked but it’s tricky business to not include words in quotes, image tags etc. i.e. non-posted words - Note, this is a very buggy example
regexp_matches(posts.raw, '\[quote(?:[^]])*\][^[]*\[\/quote]', 'g')

I’m thinking it might be easier to subtract category and title words from those found in the post_search_data table.

So, do you think the words from category name and title are automatically added to post_search_data? Hm that’s not good. Yes we should extract them.

As an example this for the default “Staff About”
search_data
'admin':11 'categori':2,17 'discuss':5 'moder':13 'privat':1 'staff':4,16,18 'topic':6 'visibl':9

raw_data

Private category for staff discussions. Topics are only visible to admins and moderators. About the Staff category Staff

So, I can’t see the “about” word in the data (Staff About).

Yes, there are a few words that I guess aren’t included because they aren’t considered helpful as search terms.

I can see the logic for not including “for”, “are”, “only”, “to”, “and”, “the” and the like.
But I am a bit surprised “about” isn’t considered a word that might be searched for

List of stop words here for reference:
https://apt-browse.org/browse/ubuntu/trusty/main/i386/postgresql-9.3/9.3.4-1/file/usr/share/postgresql/9.3/tsearch_data/english.stop

The idea is that these words appear in “most” English “texts” regardless of subject.

5 Likes

@Mittineague, back to your example, I still can’t understand where is the issue.
There are only 3 indexes for the word staff:

'staff':4,16,18

They can all be found in the text:

Private category for staff discussions. Topics are only visible to admins and moderators. About the Staff category Staff

No “staff” word is taken from “Staff About”.

Or did I misinterpret your explanations?

lol That’s what I get for being lazy and typing only “Staff About” in hopes you would know what I meant.

The category is - Staff (18)
The topic title is - About the Staff category (14-17)
The post content is - Private category for staff discussions. Topics are only visible to admins and moderators. (1-13)

Thanks, now it makes sense. So, the query is useless as is unless it is fixed to exclude “category keywords” from the statistics.

I wouldn’t say it’s entirely useless. Only that the results are corrupt.

I was thinking of trying some kind of query like

word count = word count - ((category title + topic title words) x number of posts in topic)

but my Postgres chops aren’t all that great so it will take me some time if ever to come up with it.

Still, I like using post_search_data over posts as I think it would be easier to deal with the cat and topic titles than it would be to deal with quotes, images, links etc. that are in raw / cooked

The solution would be to create a ts_vector from category title and topic title and subtract that vector from the result, per-topic.

1 Like

How can I change my website to led users see the search results of these words?

Just in case you’d like to show the query results in wordpress:

Unsure if I should start a new topic or not but… would it be possible to tweak this SQL to get a list of the X most common words used in the last Y posts, regardless of user? I’m trying to look for common technical words I can use to create a dictionary on my forum.