Discourse AI + Data Explorer?

That was a crash and burn. :fire:

Using the Discourse AI - AI Bot


Create an image for this SQL

WITH pairs AS (
SELECT p.user_id liked, pa.user_id liker
FROM post_actions pa
LEFT JOIN posts p ON p.id = pa.post_id
WHERE post_action_type_id = 2
SELECT liker liker_user_id, liked liked_user_id, count(*)
FROM pairs
GROUP BY liked, liker
LIMIT :limit








Yesterday out of curiosity did a Google search similar to this one

sql generated by ai from English

to find related information on generating SQL using AI from English worded questions. I was hoping for a list of research papers included in the list but the list was primarily a list of applications. One pattern I quickly saw for many applications listed but not all after looking at a handful of demos on such sites was that they were taking their existing tool that would allow a user to select tables and fields to construct an SQL statement similar to the way the GUI SQL builder worked in Microsoft Access as a starting point, in other words one did not start with an English question but had to select the tables and sometimes fields first before then entering the English question. While I would not consider that cheating, it does mean that to use such AI a user has to have some knowledge of SQL, specifically the meaning of a table and field for the tool to work.



I read your deleted post and agree that too is one of the ideas I think has great potential.

However I think that it needs more. A recent paper that I read, again not authoritative, but that goes into more details along those lines is

“ChatLogo: A Large Language Model-Driven Hybrid Natural-Programming Language Interface for Agent-based Modeling and Programming” by John Chen and Uri Wilensky (pdf)

Some of the key takeaways for me were

  1. Support novice programmers to “talk to computers” in a mix of programming and natural languages.
  2. Provide a more friendly interface for learners with no or little computer science backgrounds to creatively express themselves by programming computers.
  3. Instead of right away writing code and giving instructions, ChatLogo attempts to clarify the learners’ needs and intention.
  4. Instead of sending large chunks of code directly to the learner, it attempts to co-develop the NetLogo code. The learner is free to edit the code: either in NetLogo, or in natural languages through the “Ask” feature.
  5. Instead of overclaim the correctness of the code, it admits the possibility of making mistakes, and co-works with the learner to address the potential issues.

I could not find any source code or more details on how Chat was layered on top of or interfaced with NetLogo but the ideas alone are of value.

This again reminds me of how autopilots works in planes and the book I noted earlier. (ref) When I note autopilot do not think of a binary switch, think of all of the knobs and switches that give information to the autopilot and let the autopilot know what to do and what not to do.

While it seems that many other tools such as this like to use the word copilot, e.g. a Google search llm copilot source code, I tend to disagree with that analogy but will have to live with it as that is how one finds related ideas.


I deleted it because I thought I’d found a better way of dealing with the issue - why type out all those tables when ChatGPT could just be asked to parse a file that’s hosted on the internet that contains detailed information about the entire database? It turns out there are technical problems with that approach though. So I’m back to what I suggested in the deleted post. In any case, descriptions of tables, example queries, etc, are going to have to be created. It’s just a question of how to get that information to the LLM. Here’s where things are at with GPT-4:

Yes, the approach you’ve outlined makes sense and is a feasible way to use my capabilities to assist in generating SQL queries based on natural language questions.

Here’s a breakdown of the steps:

  1. Question Interpretation: The user poses a question in natural language, which I interpret to understand the information being requested.
  2. Table Identification: Based on the question and the list of all tables in the Discourse database, I identify the tables that are most likely to be relevant to the query.
  3. Detailed Table Information: Discourse sends me detailed descriptions of the relevant tables, including all columns and example queries.
  4. Query Generation: Using the detailed table information, I generate an SQL query that answers the user’s question.

This approach leverages my ability to understand natural language and map it to structured data queries. It also makes good use of the detailed table information provided by Discourse.

As for alternate approaches, one possibility could be to train a model specifically on Discourse’s database schema and common query patterns. This would likely involve a significant amount of work and may not be feasible depending on the resources available. However, it could potentially result in more accurate and efficient query generation.

Another approach could be to develop a more interactive dialogue where I ask clarifying questions when the user’s question is ambiguous or when multiple queries could potentially answer the question. This would make the process more interactive and could potentially lead to more accurate queries.

For the alternate approaches suggested in the last 2 paragraphs, I’m ignoring the first one, but if what I’m attempting doesn’t lead to reliable results, it might be something that the Discourse team could look into.

I asked GPT-4 about the second alternate approach. It’s fairly confident that given a question that couldn’t be answered via the provided table structure, it would be able to prompt the user to try to re-frame their question.

So I’m back to writing descriptions of all tables in the Discourse database. I now know what the allowed_pm_users table is used for. The ar_internal_metadata table is a bit of a head scratcher, but I think it can be ignored for now.


That is my favorite option so far, albeit a very hard one to implement. Thus, you guessed it, another paper. Again this is not an authoritative paper as there are many similar approaches.

“DiagGPT: An LLM-based Chatbot with Automatic Topic Management for Task-Oriented Dialogue” by Lang Cao (pdf)

Just replace the medical information with PostgreSQL information as needed. The nice part is the paper gives the prompts and suggest which tool with agents to use.

Going down the rabbit hole. (Click triangle to enter)

Since Task Oriented Dialogue is what appears is needed.
Google search: task oriented dialogue
Search result includes: Papers With Code - Task-Oriented Dialogue Systems
The leader board entry is T5-3b(UnifiedSKG) which includes a link to paper
“UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models” by Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I. Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A. Smith, Luke Zettlemoyer and Tao Yu (pdf)

Notice this



From https://python.langchain.com/

Construct an SQL agent from an LLM and tools. (ref)

It list a few at the bottom of the page, this one should be looked at

Use case

Enterprise data is often stored in SQL databases.

LLMs make it possible to interact with SQL databases using natural language.

LangChain offers SQL Chains and Agents to build and run SQL queries based on natural language prompts.

These are compatible with any SQL dialect supported by SQLAlchemy (e.g., MySQL, PostgreSQL, Oracle SQL, Databricks, SQLite).

They enable use cases such as:

  • Generating queries that will be run based on natural language questions
  • Creating chatbots that can answer questions based on database data
  • Building custom dashboards based on insights a user wants to analyze

EDIT (08/23/2023)

Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. In Spider 1.0, different complex SQL queries and databases appear in train and test sets. To do well on it, systems must generalize well to not only new SQL queries but also new database schemas .

EDIT (0824/2023)

Dataherald is a natural language-to-SQL engine built for enterprise-level question answering over structured data. It allows you to set up an API from your database that can answer questions in plain English.

As I often tell others, for some things in life you just have to wait and someone will do the work for you for free.


I’m still having results I’m perfectly content with. Here’s another example. Gave me what I wanted, I popped in a WHERE clause to check for a topic ID myself afterwards, but that was it. I probably could have given it that information ahead of time and it would have done that too, though:

1 Like

We now have the SQL Helper persona as part of the AI Bot module which is tailored for these kinds of requests and can be really useful for creating data explorer queries. :rocket:

(Though it does pay to double-check them :slight_smile:)

Is there a topic where examples of the SQL Helper AI bot being used can be posted?

If not, it might be nice to create such a topic and then add a link to the topic here as others may be looking for examples and find this topic first.

I think if you come up with one worth sharing with the community you can pop it in a new topic in data & reporting. There’s a new 'share conversation` button too that could be useful for this - Share Your AI Bot Conversations!

Building a Query to Locate Plugin Authors AI


Hey Bert :slight_smile:

Can you write me a query that will find all the users in the plugin_authors group?


Look up database schema

groups, group_users, users

Now that we have the schema for the relevant tables, we can write a query to find all the users in the ‘plugin-authors’ group. Based on the schema, we’ll need to join the group_users table with the groups and users tables to get the usernames of the members of the ‘plugin-authors’ group.

Here’s the query to find all the users in the ‘plugin-authors’ group:

SELECT u.id AS user_id, u.username
FROM users u
JOIN group_users gu ON gu.user_id = u.id
JOIN groups g ON g.id = gu.group_id
WHERE g.name = 'plugin_authors'

This query selects the user ID and username from the users table by joining with the group_users table on the user_id and then joining with the groups table on the group_id. It filters the results to include only those users who are part of the group with the name ‘plugin-authors’.


This topic was automatically closed after 43 hours. New replies are no longer allowed.