I’ve tried that. The results are all over the place. An interesting experiment is to provide GPT-3.5 with a minimal annotated version of the Discourse database schema just to test its SQL ability. I realize this is inefficient in terms of tokens, but it’s readable:
Minimal schema
# == Schema Information
#
# Table name: users
#
# id :integer not null, primary key
# username :string(60) not null
# created_at :datetime not null
#
# Table name: groups
#
# id :integer not null, primary key
# name :string not null
# created_at :datetime not null
#
# Table name: group_users
#
# id :integer not null, primary key
# group_id :integer not null
# user_id :integer not null
#
# Table name: posts
#
# id :integer not null, primary key
# user_id :integer
# topic_id :integer not null
# deleted_at :datetime (The application "soft deletes" posts. When a post is deleted, its `deleted_at` property is set to a :datetime. Unless explicity asked to return deleted posts, make sure the `deleted_at` column is `NOT NULL` when writing queries that are asking for data related to posts.)
#
# Table name: topics
#
# id :integer not null, primary key
# title :string not null
# category_id :integer
# created_at :datetime not null
# user_id :integer (the id of the user who created the topic)
# deleted_at :datetime (The application "soft deletes" topics. When a topic is deleted, its `deleted_at` property is set to a :datetime. Unless explicity asked to return deleted topics, make sure the `deleted_at` column is `NOT NULL` when writing queries that are asking for data related to topics.)
#
# Table name: categories
#
# id :integer not null, primary key
# name :string(50) not null
So far, the results I’m getting are impressive. This gets me thinking that the problem could be solved just by providing an annotated schema in the prompt.
For fun, I asked it what kinds of queries I could expect it to struggle with:
GPT-3.5 limitations
- Hierarchical Queries: Queries involving hierarchical data structures like recursive relationships can be complex. For example, querying for parent-child relationships within a category hierarchy.
- Temporal Queries: Queries involving time-based data, such as calculating historical trends or analyzing data over specific time intervals, might require more intricate SQL constructs.
- Window Functions: While I’ve used window functions in previous queries, more advanced uses of window functions for complex analytical tasks might be challenging.
- Geospatial Queries: Spatial databases and querying based on geographical data, like finding nearby locations or complex spatial relationships, might be complex.
- Complex Aggregations: Advanced aggregations involving multiple levels of grouping and calculations might require more intricate SQL knowledge.
- Optimization Queries: Writing highly optimized queries that involve indexes, query tuning, and performance improvements can be intricate.
- Custom Algorithms: Implementing custom algorithms or logic within a SQL query might be challenging due to the complexity of the logic involved.
- Advanced JOINs: Complex join conditions or combining multiple joins might require careful consideration and possibly more complex SQL constructs.
- Subqueries and Correlated Subqueries: While I can handle subqueries, extremely complex or deeply nested subqueries might be more challenging.
- Pivoting and Unpivoting: Transforming data between wide and tall formats (pivoting and unpivoting) can become complex in certain scenarios.
Where I’m finding problems is with attempting to disambiguate the full database schema. For example, finding a way of annotating the user_actions
table. Just providing definitions of its action_type
codes isn’t enough. It starts guessing about user_id
, target_user_id
, and acting_user_id
.
The most commonly asked for queries don’t make use of most of the tables and columns in the database. If AI is added to the Data Explorer, it might be worth looking at having “basic” and “advanced” modes. Basic mode could supply a prompt that covers most use cases. Advanced mode could allow users to select what information is set in the prompt.
It might be interesting to work backwards from a few requests for queries on meta, to see what would need to be supplied to the prompt in order to have GPT-3.5 successfully create the query.