Discourse AI + Data Explorer?

I wish I had more time to peruse the code base, but do you have the SQL stored somewhere (publicly) for the crest table schemas?

I’m doing some LLM work with my own product/company right now and would love to play with this.

1 Like

oh this is great. thanks for the link Falco! :slight_smile:

3 Likes

I just generate this direct from the schema

Something like this should get you started:


I have been exploring all sorts of approaches for getting this going. One problem though is that GPT 3.5 is just so steerable it gets confused.

Giving it schema certainly helps, but I am finding that I need to be ultra strategic with examples.

Just telling it a story about “how to count the archived topics for a user” ends up steering a query such as “how many posts did a user have” towards archived topics.

I was thinking:

graph TD
A[Add generic helpful information] --> B[Generate embeddings for question]
B --> C[Do a similarity search across examples]
C --> D[Add 3 examples to the prompt]
D --> E[Add schema to the prompt]
E --> F[Ask question]
F --> G[Get SQL]

But it turns out that unless the examples are spectacular and you do not include any counter examples … stuff gets bad.

I am starting to think that the best we have for now is just giving it schema and a preamble. Maybe adding another step at the end for query validation and retry.

5 Likes

Would you have a complete schema that shows data types?

Not sure if it fits into token budget, will try

With ChatCPT 3.5 I’m getting unexpected results when I provide the full schema. For example, it keeps trying to find the topic author in the topic_users table.

I get successful results by limiting the information in the prompt to the tables and columns that are required to write the query. Also by adding some additional information about how the columns are used. For example:

# Table name: user_actions
#
#  id              :integer          not null, primary key
#  action_type     :integer          not null (:like=>1,:was_liked=>2,:new_topic=>4,:reply=>5,:response=>6,:mention=>7,:quote=>9,:edit=>11,:new_private_message=>12,:got_private_message=>13,:solved=>15,:assigned=>16)
#  user_id         :integer          not null (the user who is to be credited with the action)
#  target_topic_id :integer
#  acting_user_id  :integer (the user who performed the action, for example, a staff user can perform an action on behalf of a regular user)
#  created_at      :datetime         not null

Another way to look at getting this to work with the Data Explorer plugin would be to have users fill out a dynamically generate form that lists the data they are looking for, and what conditions they want to apply to the data. The prompt could be generated programmatically by Discourse, then sent to the LLM to have the query written.

Edit: @sp-jordan-violet, this might be a useful starting point. Here’s an annotated partial schema that’s working for me. It works well as long as I limit my queries so that they can be answered by the provided schema. I’ve added additional details to the schema to clarify details that seem to confuse ChatGPT. The obvious downside of this approach is that a fully annotated schema will exceed ChatGPT 3.5’s token limit. I’ve included some information in the schema that could be omitted to reduce the number of tokens that are used.

Summary
# == Schema Information
#
# Table name: application_requests
#
#  id       :integer          not null, primary key
#  date     :date             not null
#  req_type :integer          not null ("http_total"=>0,"http_2xx"=>1,"http_background"=>2,"http_3xx"=>3,"http_4xx"=>4,"http_5xx"=>5,"page_view_crawler"=>6,"page_view_logged_in"=>7,"page_view_anon"=>8,"page_view_logged_in_mobile"=>9,"page_view_anon_mobile"=>10,"api"=>11,"user_api"=>12)
#  count    :integer          default(0), not null
#
# Table name: users
#
#  id                        :integer          not null, primary key
#  username                  :string(60)       not null
#  created_at                :datetime         not null
#  updated_at                :datetime         not null
#  name                      :string           (the user's real name)
#  last_posted_at            :datetime
#  active                    :boolean          default(FALSE), not null
#  username_lower            :string(60)       not null
#  last_seen_at              :datetime
#  admin                     :boolean          default(FALSE), not null
#  trust_level               :integer          not null
#  approved                  :boolean          default(FALSE), not null
#  approved_by_id            :integer
#  approved_at               :datetime
#  previous_visit_at         :datetime
#  suspended_at              :datetime
#  suspended_till            :datetime
#  date_of_birth             :date
#  ip_address                :inet
#  moderator                 :boolean          default(FALSE)
#  title                     :string
#  locale                    :string(10)
#  primary_group_id          :integer
#  registration_ip_address   :inet
#  staged                    :boolean          default(FALSE), not null
#  first_seen_at             :datetime
#  silenced_till             :datetime
#
# Table name: topics
#
#  id                        :integer          not null, primary key
#  title                     :string           not null
#  last_posted_at            :datetime
#  created_at                :datetime         not null
#  updated_at                :datetime         not null
#  views                     :integer          default(0), not null
#  posts_count               :integer          default(0), not null
#  user_id                   :integer          (the id of the user who created the topic)
#  last_post_user_id         :integer          not null (the id of the user who created the last post in the topic)
#  reply_count               :integer          default(0), not null
#  deleted_at                :datetime
#  highest_post_number       :integer          default(0), not null
#  like_count                :integer          default(0), not null
#  category_id               :integer
#  visible                   :boolean          default(TRUE), not null
#  moderator_posts_count     :integer          default(0), not null
#  closed                    :boolean          default(FALSE), not null
#  archived                  :boolean          default(FALSE), not null
#  bumped_at                 :datetime         not null
#  archetype                 :string           default("regular"), not null (can be set to either "regular" or "private_message")
#  slug                      :string
#  deleted_by_id             :integer          (the id of the user who deleted the topic)
#  participant_count         :integer          default(1)
#  word_count                :integer
#  excerpt                   :string
#  highest_staff_post_number :integer          default(0), not null
#
# Table name: posts
#
#  id                      :integer          not null, primary key
#  user_id                 :integer          (the id of the user who created the post)
#  topic_id                :integer          not null
#  post_number             :integer          not null (indicates the post's order in its topic)
#  raw                     :text             not null (the post's content)
#  created_at              :datetime         not null
#  updated_at              :datetime         not null
#  reply_to_post_number    :integer          (the post_number that the post is a reply to)
#  reply_count             :integer          default(0), not null
#  deleted_at              :datetime
#  like_count              :integer          default(0), not null
#  bookmark_count          :integer          default(0), not null
#  reads                   :integer          default(0), not null (the number of times the post has been read)
#  post_type               :integer          default(1), not null (:regular=>1, :moderator_action=>2, :small_action=>3, :whisper=>4)
#  last_editor_id          :integer          (the id of the user who last edited the post)
#  hidden                  :boolean          default(FALSE), not null
#  hidden_reason_id        :integer          (:flag_threshold_reached=>1,:flag_threshold_reached_again=>2,:new_user_spam_threshold_reached=>3,:flagged_by_tl3_user=>4,:email_spam_header_found=>5,:flagged_by_tl4_user=>6,:email_authentication_result_header=>7,:imported_as_unlisted=>8)
#  edit_reason             :string
#  word_count              :integer
#  wiki                    :boolean          default(FALSE), not null
#
# Table name: categories
#
#  id                                        :integer          not null, primary key
#  name                                      :string(50)       not null
#  topic_id                                  :integer          (the id of the topic that is used for the category's description)
#  topic_count                               :integer          default(0), not null
#  created_at                                :datetime         not null
#  updated_at                                :datetime         not null
#  user_id                                   :integer          not null (the id of the user who created the topic)
#  topics_year                               :integer          default(0)
#  topics_month                              :integer          default(0)
#  topics_week                               :integer          default(0)
#  slug                                      :string           not null
#  description                               :text
#  text_color                                :string(6)        default("FFFFFF"), not null
#  read_restricted                           :boolean          default(FALSE), not null
#  auto_close_hours                          :float
#  post_count                                :integer          default(0), not null
#  latest_post_id                            :integer
#  latest_topic_id                           :integer
#  position                                  :integer
#  parent_category_id                        :integer
#  posts_year                                :integer          default(0)
#  posts_month                               :integer          default(0)
#  posts_week                                :integer          default(0)
#  topics_day                                :integer          default(0)
#  posts_day                                 :integer          default(0)
#
# Table name: groups
#
#  id                                 :integer          not null, primary key
#  name                               :string           not null
#  created_at                         :datetime         not null
#  automatic                          :boolean          default(FALSE), not null
#  user_count                         :integer          default(0), not null
#  title                              :string
#  bio_raw                            :text             (the group's description)
#  allow_membership_requests          :boolean          default(FALSE), not null
#  full_name                          :string
#  visibility_level                   :integer          default(0), not null (who can see the group :public=>0, :logged_on_users=>1, :members=>2, :staff=>3, :owners=>4)
#  messageable_level                  :integer          default(0) (who can message the group :public=>0, :logged_on_users=>1, :members=>2, :staff=>3, :owners=>4)
#  mentionable_level                  :integer          default(0) (who can mention the group :public=>0, :logged_on_users=>1, :members=>2, :staff=>3, :owners=>4)
#  members_visibility_level           :integer          default(0), not null (who see the group's members :public=>0, :logged_on_users=>1, :members=>2, :staff=>3, :owners=>4)
#
# Table name: group_users
#
#  id                 :integer          not null, primary key
#  group_id           :integer          not null
#  user_id            :integer          not null
#  created_at         :datetime         not null
#
# Table name: user_actions
#
#  id              :integer          not null, primary key
#  action_type     :integer          not null (:like=>1,:was_liked=>2,:new_topic=>4,:reply=>5,:response=>6,:mention=>7,:quote=>9,:edit=>11,:new_private_message=>12,:got_private_message=>13,:solved=>15,:assigned=>16)
#  user_id         :integer          not null (the user who is to be credited with the action)
#  target_topic_id :integer
#  target_post_id  :integer
#  target_user_id  :integer          (for example, the id of the user whose post was liked)
#  acting_user_id  :integer          (the user who performed the action, for example, a staff user can perform an action on behalf of a regular user)
#  created_at      :datetime         not null
#  updated_at      :datetime         not null
#
# Table name: topic_views
#
#  topic_id   :integer          not null
#  viewed_at  :date             not null
#  user_id    :integer         (will be set if the topic was viewed by a logged in user)
#  ip_address :inet            (will be set if the topic was viewed by an anonymous user)
#
# Table name: user_visits
#
#  id         :integer          not null, primary key
#  user_id    :integer          not null
#  visited_at :date             not null
#  posts_read :integer          default(0)
#  mobile     :boolean          default(FALSE) (will be set to TRUE if the user visited on a mobile device)
#  time_read  :integer          default(0), not null (the value returned is the number of seconds of the visit)

Here’s a link to a chat session where I tried it out: Total Hours Anonymous August

If you want to try expanding on this list, the easiest way to do it is to copy the schema from the bottom of the model you’re interested in, directly from the Discourse code: https://github.com/discourse/discourse/tree/main/app/models.

3 Likes

It would be great to get exactly this, but a comprehensive list of every table. Where did you get this from?

EDIT: Ah, I see your link at the bottom, sorry. I probably won’t have the time to put them all together. Was hoping there was a master SQL schema somewhere :confused:

The problem is that just providing the schema isn’t enough information for ChatGPT. You need to provide it with details about things like:

  • what the application_requests req_type integer codes stand for
  • what the topics user_id column is used for
  • what the user_actions table’s action_type codes stand for a what’s the difference between that table’s user_id, target_user_id, and acting_user_id columns

With those types of details, GPT 3.5 seems to do a good job without any additional training. The problem then becomes that to provide this level of detail about the entire database will result in the prompt exceeding the ChatGPT token limit (4096 tokens, including both the prompt text and the generated output.) If this type of approach was used, there’s need to be a way to limit what gets set in the prompt based on what information the user wanted to get from the Data Explorer query.

1 Like

That’s not been my experience at all. I’ve had a pretty enjoyable experience with it this evening. I’ll share some results later.

It gives me the feeling of a competent, but very entry level/juniornrole. It gets things close, I tell it what it did wrong, and it fixes it. Repeat.

I’m actually really enjoying what it’s giving me, and the time that it saves me based on my usage tonight.

1 Like

GPT 3.5 does 16k just fine today

The trouble starts happening when the requests are complicated, accounting for edge cases and so on

1 Like

Can you share some examples of edge cases?

1 Like

Sure I plan to do another round next week, you can try the ruby script in the repo I shared

@simon I think the real value is understanding that, at least for the foreseeable future, this will never just get something right the first time. But if you know what you want, you can guide it there like a relentless intern that can go and do the grunt work.

So, without having any current knowledge of SQL beyond remembering select from where, and knowing what I wanted my end result to be, I got it to build the query I wanted just by having a conversation with it on the side, and not really taking away from me doing my day job. It really is like having a free personal assistant/intern that I just need to keep guiding in the right direction.

First and foremost, here’s my final query. I wanted a query that would return the top 100 users with the most total likes, and give me their user, username, count of total links, and a link to their one post that had the single most likes. I did not/do not know how to get this at all, and frankly don’t even know where to begin. When I want something like this, I go bother one of my engineers. I was able to not bother them, not slow down from my own work really, and still guide/instruct ChatGPT to get done what I needed done.

Final query:

WITH Most_Liked_Posts AS (
  SELECT
    p.user_id,
    p.topic_id,
    p.post_number,
    ROW_NUMBER() OVER (PARTITION BY p.user_id ORDER BY likes_count DESC) AS row_number
  FROM (
    SELECT
      p.user_id,
      p.topic_id,
      p.post_number,
      COUNT(l.id) AS likes_count
    FROM
      posts p
      LEFT JOIN post_actions l ON p.id = l.post_id AND l.post_action_type_id = 2
    WHERE
      p.user_id NOT IN (SELECT id FROM users WHERE username = 'codey')
    GROUP BY
      p.user_id,
      p.topic_id,
      p.post_number
  ) p
),
User_Likes AS (
  SELECT
    u.id AS user_id,
    COUNT(pa.id) AS total_likes
  FROM
    users u
    LEFT JOIN posts p ON u.id = p.user_id
    LEFT JOIN post_actions pa ON p.id = pa.post_id AND pa.post_action_type_id = 2
  WHERE
    u.username != 'codey'
  GROUP BY
    u.id
)
SELECT
  ul.user_id,
  u.username AS username,
  ul.total_likes,
  '<a href="/discuss/t/' || mlp.topic_id || '/' || mlp.post_number || '">' || 'Link to most-liked post' || '</a>' AS html$post
FROM
  User_Likes ul
  JOIN users u ON ul.user_id = u.id
  LEFT JOIN Most_Liked_Posts mlp ON ul.user_id = mlp.user_id AND mlp.row_number = 1
ORDER BY
  ul.total_likes DESC
LIMIT 100

Instructions in my ChatGPT account before starting the prompt:

I am only going to ask you quested related to PostgreSQL, and I only want you to respond with relevant SQL queries.

These queries are all related to databases in my community platform, Discourse.
Respond with SQL queries, and explanations of the queries. Do not use any semicolons to end SQL statements as they are not needed.

Follow my entire conversation on ChatGPT building this query here:

https://chat.openai.com/share/6a62ed59-b993-4306-8793-9aaef76f03e9

3 Likes

@sam with a complete schema like the above (again, sorry, I don’t know where to actually get this and/or get it in this format) and using langchain and a vector database to process the documents correctly before sending them into ChatGPT, plus any docs you may have on using the Data Explorer…I’m pretty confident this process would be pretty close to magic.

I tried a vector db and feeding it similar examples, it steers it too hard, we are probably going to need 1000 very close examples to be magic

1 Like

I’ll have my engineer try it as well as we’ve built a sort of “factory” to produce these pretty quickly.

Is there somewhere that I can get a complete and exhaustive document of all db schemas in this format?

# == Schema Information
#
# Table name: application_requests
#
#  id       :integer          not null, primary key
#  date     :date             not null
#  req_type :integer          not null ("http_total"=>0,"http_2xx"=>1,"http_background"=>2,"http_3xx"=>3,"http_4xx"=>4,"http_5xx"=>5,"page_view_crawler"=>6,"page_view_logged_in"=>7,"page_view_anon"=>8,"page_view_logged_in_mobile"=>9,"page_view_anon_mobile"=>10,"api"=>11,"user_api"=>12)
#  count    :integer          default(0), not null
#
# Table name: users
#
#  id                        :integer          not null, primary key
#  username                  :string(60)       not null
#  created_at                :datetime         not null
#  updated_at                :datetime         not null
#  name                      :string           (the user's real name)
#  last_posted_at            :datetime
#  active                    :boolean          default(FALSE), not null
#  username_lower            :string(60)       not null
#  last_seen_at              :datetime
#  admin                     :boolean          default(FALSE), not null
#  trust_level               :integer          not null
#  approved                  :boolean          default(FALSE), not null
#  approved_by_id            :integer
#  approved_at               :datetime
#  previous_visit_at         :datetime
#  suspended_at              :datetime
#  suspended_till            :datetime
#  date_of_birth             :date
#  ip_address                :inet
#  moderator                 :boolean          default(FALSE)
#  title                     :string
#  locale                    :string(10)
#  primary_group_id          :integer
#  registration_ip_address   :inet
#  staged                    :boolean          default(FALSE), not null
#  first_seen_at             :datetime
#  silenced_till             :datetime
1 Like

That format is super wasteful token wise, but you can just query it out of the pg schema tables in data explorer :slight_smile:

2 Likes

Haha okay I’ll have him check it out. Like I said, I’m not the one to take this much further. If I were, I’d probably not be asking these questions!

1 Like

I really want to get something working, but this is a very hard problem

3 Likes

This is just an idea, I have not tried this.

The queries needed IMO break down into uses based on duty

  • moderator
  • admin
  • developer

As such the tables needed can be grouped into ever growing sets with the smallest set being the tables needed by a moderator.

Now, much of the data needed by a moderator will be based on common joins of tables with specific columns being needed, thus a view.

If the common views are used instead of the entire schema, then hopefully many of the prompts would only need to pass along the views and not the entire schema and thus make it much easier for the LLM to generate a possible solution.

HTH


For the much harder queries, odds are if you know enough to need such a query you know enough to build the query.

1 Like