This guide is a continuation of the Data Explorer Tutorial - Part 1 - Writing Your First Query.
Now that you’ve seen how the Data Explorer works in action, let’s talk about some SQL basics that will be helpful to know as you go about your data exploring adventures.
SQL Statements
All of the actions you’ll need to perform on Discourse’s database are done with SQL statements.
SQL keywords are NOT case sensitive:
select
is the same asSELECT
, but in this tutorial, and for best practice, we will write all SQL keywords in upper-case.
Here are some of the most common SQL statements you might use. Notice that we used some of these in our previous example query:
- SELECT: Used to select data from a database. The data returned is stored in a result table, called the result-set.
SELECT column1, column2, ...
FROM table_name;
- WHERE: Used to filter records.
SELECT column1, column2, ...
FROM table_name
WHERE condition;
- ORDER BY: Used to sort the result-set in ascending or descending order.
SELECT column1, column2, ...
FROM table_name
ORDER BY column1, column2, ... ASC|DESC;
- COUNT: Returns the number of rows that matches a specified criterion.
SELECT COUNT(column_name)
FROM table_name
WHERE condition;
- GROUP BY: Often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns.
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name;
- HAVING: Used to filter the results of a GROUP BY operation. The HAVING clause is used in SQL because the WHERE keyword cannot be used with aggregate functions (Ex:
COUNT()
,MAX()
,MIN()
,SUM()
,AVG()
).
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > value;
- IN: Used to check if a value is present in a set of values or returned by a subquery. It’s a shorthand for multiple
OR
conditions, making your SQL queries more concise and easier to read.
SELECT column_name(s)
FROM table_name
WHERE column_name IN (value1, value2, ...);
Semicolons are not necessary when writing SQL queries in the Data Explorer.
PostgresSQL
Discourse currently uses PostgreSQL version 13 for its database, an open-source relational database management system (RDBMS) that emphasizes extensibility and SQL compliance.
For a detailed understanding of PostgreSQL, you may want to read through the PostgreSQL Official Documentation. Specifically, you may find the sections on Queries, Data Types, and Functions particularly relevant when writing SQL queries.
Database Tables
The Discourse Database is made up of tables. Tables are the fundamental way to store data in a relational database management system.
Each table in Discourse is identified by a name (e.g. posts
or topics
), and each tables contain records (rows) with data. Columns in tables are commonly referred to as fields.
The Discourse database contains 240+ Tables!
The Data explorer lists the 9 most important tables first in the query edit UI panel, and you can see all of the tables column structure and types with a click:
Database Schema
In PostgreSQL, as well as in other relational databases, the concepts of Primary Key and Foreign Key are fundamental. They are used to establish relationships between tables and ensure the consistency and integrity of the data.
Primary Keys
A primary key is a column or a set of columns in a table that uniquely identifies each row in that table. No two rows in a table can have the same primary key value. You can’t have a null primary key value, and the value must be unique. The primary key is used to index the data in the table, making data retrieval much faster. In the Discourse schema explorer, the primary key of a table is always listed first.
Example: The primary key
of the posts
table is the id
field.
Foreign Keys
A foreign key is a column or a set of columns in a table that is used to establish a link between the data in two tables. It acts as a cross-reference between tables because it references the primary key of another table, thereby establishing a link between them. The table with the foreign key is called the child table, and the table with the primary key is called the referenced or parent table. In the Discourse schema explorer, a foreign key is indicated by the text fkey
followed by the name of the parent table.
Example: In the posts
table, the user_id
is a foreign key
of the users
table.
Using Primary and Foreign Keys
When you’re writing queries in the Data Explorer, you can use these primary and foreign keys to JOIN
tables together and get more complex data. For example, if you wanted to find all the posts made by a particular user, you could write a query like this:
SELECT p.id, p.created_at, p.raw
FROM posts p
JOIN users u ON p.user_id = u.id
WHERE u.username = 'username_here'
In this query, we’re using the user_id
foreign key in the posts
table to join with the id
primary key in the users
table. This lets us find all the posts made by a particular user.
SQL JOINS
As previously mentioned , the JOIN
statement is used to combine rows from two or more tables, based on a related column between them.
It’s important to note that there are several types of SQL JOIN
statements, each serving a different purpose:
-
INNER JOIN: The
INNER JOIN
keyword selects records that have matching values in both tables. It returns the rows from both tables where there is a match. UseINNER JOIN
when you only want to return the records where there is a match in both tables. WhenJOIN
is used without specifying the type of join (likeLEFT
,RIGHT
, orFULL
), it defaults toINNER JOIN
. -
LEFT (OUTER) JOIN: The
LEFT JOIN
keyword returns all records from the left table (table1), and the matched records from the right table (table2). The result is NULL from the right side, if there is no match. UseLEFT JOIN
when you want to return all the records from the left table and the matched records from the right table. If there is no match, the result is NULL on the right side. -
RIGHT (OUTER) JOIN: The
RIGHT JOIN
keyword returns all records from the right table (table2), and the matched records from the left table (table1). The result is NULL from the left side, when there is no match. UseRIGHT JOIN
when you want to return all the records from the right table and the matched records from the left table. If there is no match, the result is NULL on the left side. -
FULL (OUTER) JOIN: The
FULL JOIN
keyword returns all records when there is a match in either left (table1) or right (table2) table records. UseFULL JOIN
when you want to return all records when there is a match in one of the tables.
You can also combine multiple JOIN statements to join more multiple tables together. For example, we can join the topic_tags
table with the topics
table, and then join the tags
table with the topic_tags
table to gain access to the names of the tags associated with a topic.
SELECT
t.id AS topic_id,
tg.name as tag_name
FROM
topics t
JOIN topic_tags tt ON t.id = tt.topic_id
JOIN tags tg ON tg.id = tt.tag_id
WITH Statements and Subqueries
The SQL WITH
statement, also known as a Common Table Expression (CTE), is used to create a temporary result set that can be referenced within another SELECT statement. This can be particularly useful when working with complex SQL queries as it can help to simplify them, making them easier to read and maintain.
Here’s a basic example of a WITH
statement:
WITH post_counts AS (
SELECT user_id, COUNT(*) as post_count
FROM posts
GROUP BY user_id
)
SELECT u.username, post_counts.post_count
FROM users u
JOIN post_counts ON u.id = post_counts.user_id
ORDER BY post_counts.post_count DESC;
In this query, the WITH
statement creates a temporary table post_counts
that contains each user’s ID and their total post count. The main query then joins this table with the users
table to get the usernames associated with each ID, and orders the result by the post count in descending order.
This query would return the number of posts each user has made.
Subqueries are queries nested inside another query. A subquery can be used inside SELECT statements or inside another subquery. A subquery can return a set of records, a single record, or a single value.
Here’s an example of a subquery:
SELECT u.username
FROM users u
WHERE u.id IN (
SELECT p.user_id
FROM posts p
GROUP BY p.user_id
HAVING COUNT(p.id) > 100
)
In this example, the subquery (SELECT p.user_id FROM posts p GROUP BY p.user_id HAVING COUNT(p.id) > 100)
selects the user_id
from the posts
table for users who have made more than 100 posts. The main query then gets the username
from the users
table for those user_id
s.
This query would return a list of usernames for users who have made more than 100 posts.