SQL: The most N used words per user (speak their language!)

meglio · January 27, 2016, 5:55am

By looking at the top 20…50 words used by a user, I hope to identify who cares about what the most, and then use the information to engage them in right topics and ask them right questions for better motivation.

What tables / columns would be useful for this task?

I suspect I should play with post_search_data.search_data, which is a tsvector, i.e. words and their frequencies per post. But if you think there’s a better way, please advise!

zogstrip · January 27, 2016, 9:01am

“post_search_data.search_data” is fine if you want to work on lexemes since

a “tsvector” value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word.

If you want to work on raw words, then it’s better to use “posts.raw”

meglio · January 28, 2016, 1:51am

So, here it is:

“Speak their language” SQL query

This query fetches 60 lexemes that are most frequently used by users in their messages from last 90 days, per user.

Notes

users are ordered by the number of words they posted in the last 90 days, most active posters first
only those messages are included that were posted in the last 90 days
instead of counting word tokens, the query counts lexems, so that “dogs” and “dog” are interpreted as the same word

Application

Control your language. I noticed I use the word “flag” too often. Maybe not a good habit for an admin who tries to be friendly!
Find who cares about what the most. For instance:
- From the screenshot, I can see that user shcher86 used word cheese 517 times, cottage cheese 133 times and recipe 202 times. Now I know what to ask her about
- Admin Brovarchanka really cares about goats (353), and also uses the :wink: smile all the time. I should care more about users and goats too, rather than flags!

SQL Query

WITH

posters AS (
  SELECT SUM(p.word_count) as total_words,
    u.*
    
  FROM users u
  LEFT JOIN posts p ON p.user_id = u.id
  WHERE p.created_at > now()::date - 91
  GROUP BY u.id
)


SELECT u.id as user_id,
  
  (SELECT string_agg(word || ' - ' || nentry, ' ') FROM

    (SELECT * FROM ts_stat('
    
      SELECT search_data
      FROM post_search_data psd
      LEFT JOIN posts p ON p.id = psd.post_id
      WHERE p.created_at > now()::date - 91
        AND p.user_id = ' || u.id)

    ORDER BY nentry DESC, ndoc DESC, word
    LIMIT 60
    ) freq
  ) as top_words
  
  
FROM posters u
ORDER BY u.total_words DESC
LIMIT 100

Tuning

LIMIT 100 limits to the top 100 posters; adjust to your needs
p.created_at > now()::date - 91 limits analysis to posts in last 90 days:
91 = 90 days
90 = 89 days
… and so on
You have to change it in 2 places in the query
LIMIT 60 limits output to 60 lexemes, that’s where you may change it if needs be

Enjoy learning your users’ language!
P.S. And don’t ask your users to flag too often

DeanMarkTaylor · April 10, 2016, 2:50am

Well done and thanks @meglio, interesting results across a few Discourse instances.

I turned this into a parametrised query for Data Explorer (@riking awesome job on this plugin!)

most-used-words-per-user.dcquery (1).json (1.8 KB)

-- [params]
-- int :past_x_days = 90
-- int :top_x_words = 60
-- int :top_x_users = 100

WITH posters AS (
    SELECT
        sum (p.word_count ) AS total_words,
        u.*
    FROM
        users u
    LEFT JOIN posts p ON p.user_id = u.id
WHERE
    p.created_at > now() ::date - ( :past_x_days + 1 )
GROUP BY
    u.id
)

SELECT
    u.id AS user_id,
    (
        SELECT
            string_agg ( word || ' - ' || nentry, ' ' )
        FROM (
                SELECT
                    *
                FROM
                    ts_stat ('
                            SELECT search_data FROM post_search_data psd
                            LEFT JOIN posts p
                            ON p.id = psd.post_id
                            WHERE p.created_at > now()::date - ' || (
                            :past_x_days + 1 )
                        || ' AND p.user_id = ' || u.id
                    )
                ORDER BY
                    nentry DESC,
                    ndoc DESC,
                    word
                LIMIT :top_x_words
        )
        AS freq
    )
    AS top_words
FROM
    posters AS u
ORDER BY
    u.total_words DESC
LIMIT :top_x_users

Mittineague · April 10, 2016, 5:30am

A really cool query, but I’ve noticed some unexpected results.

Words from the Category and Topic tiles are being included in top_words, 1 time for each post made in the topic.

In other words, if the category is “ABC” and the topic is “XYZ” if a member makes 25 posts in various topics in the ABC category including 10 in the XXZ topic, even if they never made a post with either ABC or XYZ in it, the count will show as
ABC - 25, XYZ - 10

On a positive note, words inside quotes, img tags and links don’t appear to be included.

meglio · April 10, 2016, 6:24am

I don’t extract words on my own - I use topic search data vector generated in PostgreSQL that Discourse builds for search purposes.

Mittineague · April 10, 2016, 6:34am

I’ve been messing with queries using posts.raw and posts.cooked but it’s tricky business to not include words in quotes, image tags etc. i.e. non-posted words - Note, this is a very buggy example
regexp_matches(posts.raw, '\[quote(?:[^]])*\][^[]*\[\/quote]', 'g')

I’m thinking it might be easier to subtract category and title words from those found in the post_search_data table.

meglio · April 10, 2016, 6:36am

So, do you think the words from category name and title are automatically added to post_search_data? Hm that’s not good. Yes we should extract them.

Mittineague · April 10, 2016, 7:08am

As an example this for the default “Staff About”
search_data
'admin':11 'categori':2,17 'discuss':5 'moder':13 'privat':1 'staff':4,16,18 'topic':6 'visibl':9

raw_data

Private category for staff discussions. Topics are only visible to admins and moderators. About the Staff category Staff

meglio · April 10, 2016, 11:02am

So, I can’t see the “about” word in the data (Staff About).

Mittineague · April 10, 2016, 6:53pm

Yes, there are a few words that I guess aren’t included because they aren’t considered helpful as search terms.

I can see the logic for not including “for”, “are”, “only”, “to”, “and”, “the” and the like.
But I am a bit surprised “about” isn’t considered a word that might be searched for

DeanMarkTaylor · April 10, 2016, 7:17pm

List of stop words here for reference:
https://apt-browse.org/browse/ubuntu/trusty/main/i386/postgresql-9.3/9.3.4-1/file/usr/share/postgresql/9.3/tsearch_data/english.stop

The idea is that these words appear in “most” English “texts” regardless of subject.

meglio · April 11, 2016, 1:52am

@Mittineague, back to your example, I still can’t understand where is the issue.
There are only 3 indexes for the word staff:

'staff':4,16,18

They can all be found in the text:

Private category for staff discussions. Topics are only visible to admins and moderators. About the Staff category Staff

No “staff” word is taken from “Staff About”.

Or did I misinterpret your explanations?

Mittineague · April 11, 2016, 2:50am

lol That’s what I get for being lazy and typing only “Staff About” in hopes you would know what I meant.

The category is - Staff (18)
The topic title is - About the Staff category (14-17)
The post content is - Private category for staff discussions. Topics are only visible to admins and moderators. (1-13)

meglio · April 11, 2016, 3:04am

Thanks, now it makes sense. So, the query is useless as is unless it is fixed to exclude “category keywords” from the statistics.

Mittineague · April 11, 2016, 3:17am

I wouldn’t say it’s entirely useless. Only that the results are corrupt.

I was thinking of trying some kind of query like

word count = word count - ((category title + topic title words) x number of posts in topic)

but my Postgres chops aren’t all that great so it will take me some time if ever to come up with it.

Still, I like using post_search_data over posts as I think it would be easier to deal with the cat and topic titles than it would be to deal with quotes, images, links etc. that are in raw / cooked

meglio · April 11, 2016, 3:22am

The solution would be to create a ts_vector from category title and topic title and subtract that vector from the result, per-topic.

saori · May 19, 2016, 5:51am

How can I change my website to led users see the search results of these words?

meglio · May 19, 2016, 6:06am

Just in case you’d like to show the query results in wordpress:

DiscourseMetrics · May 17, 2019, 8:50pm

Unsure if I should start a new topic or not but… would it be possible to tweak this SQL to get a list of the X most common words used in the last Y posts, regardless of user? I’m trying to look for common technical words I can use to create a dictionary on my forum.

Topic		Replies	Views
Understanding what our users are posting about Data & reporting sql-query	9	259	June 20, 2024
SQL query in Data Explorer to pull latest topics? Data & reporting sql-query	11	2819	November 21, 2022
Top X posters in a given timeframe Data & reporting sql-query	3	140	May 28, 2024
See what categories new users (& other trust levels) post in most Data & reporting sql-query	3	1839	February 13, 2018
User-posted code blocks in a category Data & reporting sql-query	3	221	May 13, 2024