Discourse AI + Data Explorer?

I decided I wanted a particularly query today. Not wanting to bug my engineers (too much) I decided to ask ChatGPT to write a data explorer query for me based on what I knew, and what I knew I wanted. I was pretty blown away that it got everything correct (aside from adding a semicolon at the end that it shouldn’t have).

Here’s the query it built for me when I gave it some parameters and asked it to build me a query that gave me a running total number of all users by month:

SELECT 
    timeframe "Month, Year",
    user_count as "Users Joined This Month",
    SUM(user_count) OVER (ORDER BY TO_DATE(timeframe, 'Month, YYYY'))::integer as "Users (Running Total)"
FROM (
    SELECT 
        TO_CHAR(u.created_at, 'Month, YYYY') as "timeframe",
        count(u.id) as "user_count"
    FROM 
        users u
    GROUP BY 
        timeframe
    ORDER BY
        timeframe
) subquery
ORDER BY 
    TO_DATE(timeframe, 'Month, YYYY')

Has there been consideration for integrating AI into the data explorer area? :slight_smile:

12 Likes

A post was split to a new topic: Could we re-open the topic to share Explorer queries?

Care to share the prompt, the version of ChatGPT and the model used with ChatGPT. :slightly_smiling_face:

For example


ChatGPT August 3 Version
Model: GPT-3.5
Prompt

What is Discourse Data Explorer?

Completion

As of my last update in September 2021, there is no specific tool or product called “Discourse Data Explorer” that I am aware of. However, I can provide a general explanation based on the information available up to that point.

Discourse is an open-source, modern forum software that allows online communities to have discussions and interactions. It’s widely used for hosting discussions, Q&A forums, and support communities. Discourse provides a rich set of features for managing discussions, user engagement, and moderation.

A “Data Explorer” typically refers to a tool or feature that allows users to query and explore data within a system or platform. It’s often used in the context of databases or data analytics platforms.

Given the two components, “Discourse Data Explorer” could potentially be a hypothetical tool or plugin that provides users with the ability to explore and query data within a Discourse forum. This could be helpful for community administrators or data analysts who want to gain insights into user behavior, content trends, or other metrics related to the discussions taking place on the forum.

It’s important to note that my information might be outdated, and new developments might have occurred after my last update. If “Discourse Data Explorer” is indeed a real product or feature, I recommend searching for the latest information to get accurate details about its capabilities and usage.


The reason for asking is that you have an excellent idea and it was nice to see it shared but in the future others may try to ask another prompt and sharing the prompt, completion and details of ChatGPT are needed if help is needed to understand any problems.

Note: I am a category moderator on the OpenAI Discourse forum so have lots of experience with those seeking help with ChatGPT prompts. (ref)


The version of ChatGPT can be found in the lower right below where a prompt is entered

The model can be found near the top of the page in the middle

image

1 Like

I…don’t like this bot.

1 Like

Extending this idea a bit, allowing users to interact with applications using natural language is pretty much my dream as far as AI goes. With Discourse and the Data Explorer plugin, it could be possible to get results from prompts like “return a list of users who have expressed interest in moderation.” Or even better, “send an invitation to the community_moderation group to all users who have expressed an interest in moderation.”

Getting back to what’s possible now:

Testing with ChatGPT 3.0 and 3.5, I’ve gotten good results when asking it to return a PostgreSQL query and copying the structure of the tables I’m interested in into the prompt. For example:

Summary

Me: Return a postgresql query that lists all usernames for users who are in the “community_moderators” group for the following database structure:

# == Schema Information
#
# Table name: groups
#
#  id                                 :integer          not null, primary key
#  name                               :string           not null

# == Schema Information
#
# Table name: group_users
#
#  id                 :integer          not null, primary key
#  group_id           :integer          not null

# == Schema Information
#
# Table name: users
#
#  id                        :integer          not null, primary key
#  username                  :string(60)       not null

ChatGPT 3.5: Sure, here’s a PostgreSQL query that lists all usernames for users who are in the “community_moderators” group based on the given database structure:

SELECT u.username
FROM users u
JOIN group_users gu ON u.id = gu.user_id
JOIN groups g ON gu.group_id = g.id
WHERE g.name = 'community_moderators';

This query retrieves the usernames from the users table by joining it with the group_users table on the user ID and then joining with the groups table on the group ID. It filters the results to only include usernames for users who are in the “community_moderators” group.

Me: Here is the structure of the application’s topics and posts tables:

# == Schema Information
#
# Table name: topics
#
#  id                        :integer          not null, primary key
#  title                     :string           not null

# == Schema Information
#
# Table name: posts
#
#  id                      :integer          not null, primary key
#  user_id                 :integer
#  topic_id                :integer          not null
#  post_number             :integer          not null
#  raw                     :text             not null

Return a query that lists all topic ids, post ids, and usernames for all posts created by members of the “community_moderators” group that give some indication that the poster is interested in how to moderate polarizing discussions.

ChatGPT 3.5: To achieve this, you would need to join the posts , topics , users , and group_users tables based on the provided schema. Here’s a PostgreSQL query that retrieves the requested information:

SELECT t.id AS topic_id, p.id AS post_id, u.username
FROM posts p
JOIN topics t ON p.topic_id = t.id
JOIN users u ON p.user_id = u.id
JOIN group_users gu ON u.id = gu.user_id
JOIN groups g ON gu.group_id = g.id
WHERE g.name = 'community_moderators'
  AND (p.raw ILIKE '%moderate%' OR p.raw ILIKE '%polarizing%' OR p.raw ILIKE '%discussion%');

This query retrieves the topic_id, post_id, and username for all posts created by members of the “community_moderators” group that contain certain keywords indicating an interest in moderating polarizing discussions. The ILIKE operator is used for case-insensitive pattern matching in the raw column of the posts table.

Please note that this query assumes that the keywords you provided are indicative of the desired posts. You might need to adjust or expand the keywords based on your specific criteria.

Me: Write a query that returns a count of posts created in each weekly period since September 2022, created by members of the community_moderators group that indicate some interest in how to moderate polarizing discussions.

ChatGPT 3.5: To achieve this, you can use the following PostgreSQL query to return a count of posts created in each weekly period since September 2022, created by members of the “community_moderators” group that indicate an interest in how to moderate polarizing discussions:

SELECT
    DATE_TRUNC('week', p.created_at) AS week_start,
    COUNT(p.id) AS post_count
FROM
    posts p
JOIN
    users u ON p.user_id = u.id
JOIN
    group_users gu ON u.id = gu.user_id
JOIN
    groups g ON gu.group_id = g.id
WHERE
    g.name = 'community_moderators'
    AND (p.raw ILIKE '%moderate%'
         OR p.raw ILIKE '%polarizing%'
         OR p.raw ILIKE '%political%'
         OR p.raw ILIKE '%contentious%'
         OR p.raw ILIKE '%debate%'
         OR p.raw ILIKE '%controversial%'
         OR p.raw ILIKE '%discussion%')
    AND p.created_at >= '2022-09-01' -- Start date: September 2022
GROUP BY
    week_start
ORDER BY
    week_start;

This query calculates the count of posts created by members of the “community_moderators” group that contain the specified keywords in each weekly period since September 2022. It groups the results by week and orders them chronologically. Please make sure to adjust the column names and table relationships based on your actual database schema.


The results seem good to me and represent the types of queries that took me a fair amount of time to write in the past. I’m assuming it would be possible to train a model on the Discourse database structure so that details about the structure could be left out of the prompts.

4 Likes

oh this is great, i’ve been having mixed results with chatGPT data explorer sql queries but doing it this way makes so much sense. i didn’t even think of providing table structure and asking for postgreSQL queries. :woman_facepalming:t2: thank you for the tip @simon :slight_smile:

1 Like

Is there anywhere I can see a complete schema of the discourse database(s)?

1 Like

hi @so-jordan-violet, i just use the collapsible table tool thing (i don’t know what it’s called! lol) on the right side in the data explorer query edit interface:

2 Likes

If we can get a text-based document (especially if it’s in the sql create table format) we could supply that as instructions and probably get pretty reliable results.

1 Like

Yes, we have this item on the AI team backlog since early this year. The idea is to ingest the whole (or the important part) of the Discourse schema in the prompt to allow it to create any queries.

We’ve been building GitHub - discourse/data-explorer-examples: Examples for Discourse Data Explorer to help provide examples to the model, so it can be better grounded.

7 Likes

I wish I had more time to peruse the code base, but do you have the SQL stored somewhere (publicly) for the crest table schemas?

I’m doing some LLM work with my own product/company right now and would love to play with this.

1 Like

oh this is great. thanks for the link Falco! :slight_smile:

3 Likes

I just generate this direct from the schema

Something like this should get you started:


I have been exploring all sorts of approaches for getting this going. One problem though is that GPT 3.5 is just so steerable it gets confused.

Giving it schema certainly helps, but I am finding that I need to be ultra strategic with examples.

Just telling it a story about “how to count the archived topics for a user” ends up steering a query such as “how many posts did a user have” towards archived topics.

I was thinking:

graph TD
A[Add generic helpful information] --> B[Generate embeddings for question]
B --> C[Do a similarity search across examples]
C --> D[Add 3 examples to the prompt]
D --> E[Add schema to the prompt]
E --> F[Ask question]
F --> G[Get SQL]

But it turns out that unless the examples are spectacular and you do not include any counter examples … stuff gets bad.

I am starting to think that the best we have for now is just giving it schema and a preamble. Maybe adding another step at the end for query validation and retry.

5 Likes

Would you have a complete schema that shows data types?

Not sure if it fits into token budget, will try

With ChatCPT 3.5 I’m getting unexpected results when I provide the full schema. For example, it keeps trying to find the topic author in the topic_users table.

I get successful results by limiting the information in the prompt to the tables and columns that are required to write the query. Also by adding some additional information about how the columns are used. For example:

# Table name: user_actions
#
#  id              :integer          not null, primary key
#  action_type     :integer          not null (:like=>1,:was_liked=>2,:new_topic=>4,:reply=>5,:response=>6,:mention=>7,:quote=>9,:edit=>11,:new_private_message=>12,:got_private_message=>13,:solved=>15,:assigned=>16)
#  user_id         :integer          not null (the user who is to be credited with the action)
#  target_topic_id :integer
#  acting_user_id  :integer (the user who performed the action, for example, a staff user can perform an action on behalf of a regular user)
#  created_at      :datetime         not null

Another way to look at getting this to work with the Data Explorer plugin would be to have users fill out a dynamically generate form that lists the data they are looking for, and what conditions they want to apply to the data. The prompt could be generated programmatically by Discourse, then sent to the LLM to have the query written.

Edit: @sp-jordan-violet, this might be a useful starting point. Here’s an annotated partial schema that’s working for me. It works well as long as I limit my queries so that they can be answered by the provided schema. I’ve added additional details to the schema to clarify details that seem to confuse ChatGPT. The obvious downside of this approach is that a fully annotated schema will exceed ChatGPT 3.5’s token limit. I’ve included some information in the schema that could be omitted to reduce the number of tokens that are used.

Summary
# == Schema Information
#
# Table name: application_requests
#
#  id       :integer          not null, primary key
#  date     :date             not null
#  req_type :integer          not null ("http_total"=>0,"http_2xx"=>1,"http_background"=>2,"http_3xx"=>3,"http_4xx"=>4,"http_5xx"=>5,"page_view_crawler"=>6,"page_view_logged_in"=>7,"page_view_anon"=>8,"page_view_logged_in_mobile"=>9,"page_view_anon_mobile"=>10,"api"=>11,"user_api"=>12)
#  count    :integer          default(0), not null
#
# Table name: users
#
#  id                        :integer          not null, primary key
#  username                  :string(60)       not null
#  created_at                :datetime         not null
#  updated_at                :datetime         not null
#  name                      :string           (the user's real name)
#  last_posted_at            :datetime
#  active                    :boolean          default(FALSE), not null
#  username_lower            :string(60)       not null
#  last_seen_at              :datetime
#  admin                     :boolean          default(FALSE), not null
#  trust_level               :integer          not null
#  approved                  :boolean          default(FALSE), not null
#  approved_by_id            :integer
#  approved_at               :datetime
#  previous_visit_at         :datetime
#  suspended_at              :datetime
#  suspended_till            :datetime
#  date_of_birth             :date
#  ip_address                :inet
#  moderator                 :boolean          default(FALSE)
#  title                     :string
#  locale                    :string(10)
#  primary_group_id          :integer
#  registration_ip_address   :inet
#  staged                    :boolean          default(FALSE), not null
#  first_seen_at             :datetime
#  silenced_till             :datetime
#
# Table name: topics
#
#  id                        :integer          not null, primary key
#  title                     :string           not null
#  last_posted_at            :datetime
#  created_at                :datetime         not null
#  updated_at                :datetime         not null
#  views                     :integer          default(0), not null
#  posts_count               :integer          default(0), not null
#  user_id                   :integer          (the id of the user who created the topic)
#  last_post_user_id         :integer          not null (the id of the user who created the last post in the topic)
#  reply_count               :integer          default(0), not null
#  deleted_at                :datetime
#  highest_post_number       :integer          default(0), not null
#  like_count                :integer          default(0), not null
#  category_id               :integer
#  visible                   :boolean          default(TRUE), not null
#  moderator_posts_count     :integer          default(0), not null
#  closed                    :boolean          default(FALSE), not null
#  archived                  :boolean          default(FALSE), not null
#  bumped_at                 :datetime         not null
#  archetype                 :string           default("regular"), not null (can be set to either "regular" or "private_message")
#  slug                      :string
#  deleted_by_id             :integer          (the id of the user who deleted the topic)
#  participant_count         :integer          default(1)
#  word_count                :integer
#  excerpt                   :string
#  highest_staff_post_number :integer          default(0), not null
#
# Table name: posts
#
#  id                      :integer          not null, primary key
#  user_id                 :integer          (the id of the user who created the post)
#  topic_id                :integer          not null
#  post_number             :integer          not null (indicates the post's order in its topic)
#  raw                     :text             not null (the post's content)
#  created_at              :datetime         not null
#  updated_at              :datetime         not null
#  reply_to_post_number    :integer          (the post_number that the post is a reply to)
#  reply_count             :integer          default(0), not null
#  deleted_at              :datetime
#  like_count              :integer          default(0), not null
#  bookmark_count          :integer          default(0), not null
#  reads                   :integer          default(0), not null (the number of times the post has been read)
#  post_type               :integer          default(1), not null (:regular=>1, :moderator_action=>2, :small_action=>3, :whisper=>4)
#  last_editor_id          :integer          (the id of the user who last edited the post)
#  hidden                  :boolean          default(FALSE), not null
#  hidden_reason_id        :integer          (:flag_threshold_reached=>1,:flag_threshold_reached_again=>2,:new_user_spam_threshold_reached=>3,:flagged_by_tl3_user=>4,:email_spam_header_found=>5,:flagged_by_tl4_user=>6,:email_authentication_result_header=>7,:imported_as_unlisted=>8)
#  edit_reason             :string
#  word_count              :integer
#  wiki                    :boolean          default(FALSE), not null
#
# Table name: categories
#
#  id                                        :integer          not null, primary key
#  name                                      :string(50)       not null
#  topic_id                                  :integer          (the id of the topic that is used for the category's description)
#  topic_count                               :integer          default(0), not null
#  created_at                                :datetime         not null
#  updated_at                                :datetime         not null
#  user_id                                   :integer          not null (the id of the user who created the topic)
#  topics_year                               :integer          default(0)
#  topics_month                              :integer          default(0)
#  topics_week                               :integer          default(0)
#  slug                                      :string           not null
#  description                               :text
#  text_color                                :string(6)        default("FFFFFF"), not null
#  read_restricted                           :boolean          default(FALSE), not null
#  auto_close_hours                          :float
#  post_count                                :integer          default(0), not null
#  latest_post_id                            :integer
#  latest_topic_id                           :integer
#  position                                  :integer
#  parent_category_id                        :integer
#  posts_year                                :integer          default(0)
#  posts_month                               :integer          default(0)
#  posts_week                                :integer          default(0)
#  topics_day                                :integer          default(0)
#  posts_day                                 :integer          default(0)
#
# Table name: groups
#
#  id                                 :integer          not null, primary key
#  name                               :string           not null
#  created_at                         :datetime         not null
#  automatic                          :boolean          default(FALSE), not null
#  user_count                         :integer          default(0), not null
#  title                              :string
#  bio_raw                            :text             (the group's description)
#  allow_membership_requests          :boolean          default(FALSE), not null
#  full_name                          :string
#  visibility_level                   :integer          default(0), not null (who can see the group :public=>0, :logged_on_users=>1, :members=>2, :staff=>3, :owners=>4)
#  messageable_level                  :integer          default(0) (who can message the group :public=>0, :logged_on_users=>1, :members=>2, :staff=>3, :owners=>4)
#  mentionable_level                  :integer          default(0) (who can mention the group :public=>0, :logged_on_users=>1, :members=>2, :staff=>3, :owners=>4)
#  members_visibility_level           :integer          default(0), not null (who see the group's members :public=>0, :logged_on_users=>1, :members=>2, :staff=>3, :owners=>4)
#
# Table name: group_users
#
#  id                 :integer          not null, primary key
#  group_id           :integer          not null
#  user_id            :integer          not null
#  created_at         :datetime         not null
#
# Table name: user_actions
#
#  id              :integer          not null, primary key
#  action_type     :integer          not null (:like=>1,:was_liked=>2,:new_topic=>4,:reply=>5,:response=>6,:mention=>7,:quote=>9,:edit=>11,:new_private_message=>12,:got_private_message=>13,:solved=>15,:assigned=>16)
#  user_id         :integer          not null (the user who is to be credited with the action)
#  target_topic_id :integer
#  target_post_id  :integer
#  target_user_id  :integer          (for example, the id of the user whose post was liked)
#  acting_user_id  :integer          (the user who performed the action, for example, a staff user can perform an action on behalf of a regular user)
#  created_at      :datetime         not null
#  updated_at      :datetime         not null
#
# Table name: topic_views
#
#  topic_id   :integer          not null
#  viewed_at  :date             not null
#  user_id    :integer         (will be set if the topic was viewed by a logged in user)
#  ip_address :inet            (will be set if the topic was viewed by an anonymous user)
#
# Table name: user_visits
#
#  id         :integer          not null, primary key
#  user_id    :integer          not null
#  visited_at :date             not null
#  posts_read :integer          default(0)
#  mobile     :boolean          default(FALSE) (will be set to TRUE if the user visited on a mobile device)
#  time_read  :integer          default(0), not null (the value returned is the number of seconds of the visit)

Here’s a link to a chat session where I tried it out: Total Hours Anonymous August

If you want to try expanding on this list, the easiest way to do it is to copy the schema from the bottom of the model you’re interested in, directly from the Discourse code: https://github.com/discourse/discourse/tree/main/app/models.

3 Likes

It would be great to get exactly this, but a comprehensive list of every table. Where did you get this from?

EDIT: Ah, I see your link at the bottom, sorry. I probably won’t have the time to put them all together. Was hoping there was a master SQL schema somewhere :confused:

The problem is that just providing the schema isn’t enough information for ChatGPT. You need to provide it with details about things like:

  • what the application_requests req_type integer codes stand for
  • what the topics user_id column is used for
  • what the user_actions table’s action_type codes stand for a what’s the difference between that table’s user_id, target_user_id, and acting_user_id columns

With those types of details, GPT 3.5 seems to do a good job without any additional training. The problem then becomes that to provide this level of detail about the entire database will result in the prompt exceeding the ChatGPT token limit (4096 tokens, including both the prompt text and the generated output.) If this type of approach was used, there’s need to be a way to limit what gets set in the prompt based on what information the user wanted to get from the Data Explorer query.

1 Like

That’s not been my experience at all. I’ve had a pretty enjoyable experience with it this evening. I’ll share some results later.

It gives me the feeling of a competent, but very entry level/juniornrole. It gets things close, I tell it what it did wrong, and it fixes it. Repeat.

I’m actually really enjoying what it’s giving me, and the time that it saves me based on my usage tonight.

1 Like

GPT 3.5 does 16k just fine today

The trouble starts happening when the requests are complicated, accounting for edge cases and so on

1 Like