Data Explorer Tutorial - Part 2 - Discourse SQL Basics

:discourse: This guide is a continuation of the Data Explorer Tutorial - Part 1 - Writing Your First Query.

Now that you’ve seen how the Data Explorer works in action, let’s talk about some SQL basics that will be helpful to know as you go about your data exploring adventures.

SQL Statements

All of the actions you’ll need to perform on Discourse’s database are done with SQL statements.

:discourse: SQL keywords are NOT case sensitive: select is the same as SELECT, but in this tutorial, and for best practice, we will write all SQL keywords in upper-case.

Here are some of the most common SQL statements you might use. Notice that we used some of these in our previous example query:

  • SELECT: Used to select data from a database. The data returned is stored in a result table, called the result-set.
SELECT column1, column2, ...
FROM table_name;
  • WHERE: Used to filter records.
SELECT column1, column2, ...
FROM table_name
WHERE condition;
  • ORDER BY: Used to sort the result-set in ascending or descending order.
SELECT column1, column2, ...
FROM table_name
ORDER BY column1, column2, ... ASC|DESC;
  • COUNT: Returns the number of rows that matches a specified criterion.
SELECT COUNT(column_name)
FROM table_name
WHERE condition;
  • GROUP BY: Often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns.
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name;
  • HAVING: Used to filter the results of a GROUP BY operation. The HAVING clause is used in SQL because the WHERE keyword cannot be used with aggregate functions (Ex: COUNT() , MAX() , MIN() , SUM() , AVG()).
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > value;
  • IN: Used to check if a value is present in a set of values or returned by a subquery. It’s a shorthand for multiple OR conditions, making your SQL queries more concise and easier to read.
SELECT column_name(s)
FROM table_name
WHERE column_name IN (value1, value2, ...);

:discourse: Semicolons are not necessary when writing SQL queries in the Data Explorer.

PostgresSQL

Discourse currently uses PostgreSQL version 13 for its database, an open-source relational database management system (RDBMS) that emphasizes extensibility and SQL compliance.

For a detailed understanding of PostgreSQL, you may want to read through the PostgreSQL Official Documentation. Specifically, you may find the sections on Queries, Data Types, and Functions particularly relevant when writing SQL queries.

Database Tables

The Discourse Database is made up of tables. Tables are the fundamental way to store data in a relational database management system.

Each table in Discourse is identified by a name (e.g. posts or topics), and each tables contain records (rows) with data. Columns in tables are commonly referred to as fields.

:discourse: The Discourse database contains 240+ Tables!

The Data explorer lists the 9 most important tables first in the query edit UI panel, and you can see all of the tables column structure and types with a click:

Database Schema

In PostgreSQL, as well as in other relational databases, the concepts of Primary Key and Foreign Key are fundamental. They are used to establish relationships between tables and ensure the consistency and integrity of the data.

Primary Keys

A primary key is a column or a set of columns in a table that uniquely identifies each row in that table. No two rows in a table can have the same primary key value. You can’t have a null primary key value, and the value must be unique. The primary key is used to index the data in the table, making data retrieval much faster. In the Discourse schema explorer, the primary key of a table is always listed first.

Example: The primary key of the posts table is the id field.

Foreign Keys

A foreign key is a column or a set of columns in a table that is used to establish a link between the data in two tables. It acts as a cross-reference between tables because it references the primary key of another table, thereby establishing a link between them. The table with the foreign key is called the child table, and the table with the primary key is called the referenced or parent table. In the Discourse schema explorer, a foreign key is indicated by the text fkey followed by the name of the parent table.

Example: In the posts table, the user_id is a foreign key of the users table.

Using Primary and Foreign Keys

When you’re writing queries in the Data Explorer, you can use these primary and foreign keys to JOIN tables together and get more complex data. For example, if you wanted to find all the posts made by a particular user, you could write a query like this:

SELECT p.id, p.created_at, p.raw
FROM posts p
JOIN users u ON p.user_id = u.id
WHERE u.username = 'username_here'

In this query, we’re using the user_id foreign key in the posts table to join with the id primary key in the users table. This lets us find all the posts made by a particular user.

SQL JOINS

As previously mentioned , the JOIN statement is used to combine rows from two or more tables, based on a related column between them.

It’s important to note that there are several types of SQL JOIN statements, each serving a different purpose:

  1. INNER JOIN: The INNER JOIN keyword selects records that have matching values in both tables. It returns the rows from both tables where there is a match. Use INNER JOIN when you only want to return the records where there is a match in both tables. When JOIN is used without specifying the type of join (like LEFT , RIGHT , or FULL ), it defaults to INNER JOIN .

  2. LEFT (OUTER) JOIN: The LEFT JOIN keyword returns all records from the left table (table1), and the matched records from the right table (table2). The result is NULL from the right side, if there is no match. Use LEFT JOIN when you want to return all the records from the left table and the matched records from the right table. If there is no match, the result is NULL on the right side.

  3. RIGHT (OUTER) JOIN: The RIGHT JOIN keyword returns all records from the right table (table2), and the matched records from the left table (table1). The result is NULL from the left side, when there is no match. Use RIGHT JOIN when you want to return all the records from the right table and the matched records from the left table. If there is no match, the result is NULL on the left side.

  4. FULL (OUTER) JOIN: The FULL JOIN keyword returns all records when there is a match in either left (table1) or right (table2) table records. Use FULL JOIN when you want to return all records when there is a match in one of the tables.

You can also combine multiple JOIN statements to join more multiple tables together. For example, we can join the topic_tags table with the topics table, and then join the tags table with the topic_tags table to gain access to the names of the tags associated with a topic.

SELECT 
    t.id AS topic_id, 
    tg.name as tag_name
FROM 
    topics t
JOIN topic_tags tt ON t.id = tt.topic_id
JOIN tags tg ON tg.id = tt.tag_id

WITH Statements and Subqueries

The SQL WITH statement, also known as a Common Table Expression (CTE), is used to create a temporary result set that can be referenced within another SELECT statement. This can be particularly useful when working with complex SQL queries as it can help to simplify them, making them easier to read and maintain.

Here’s a basic example of a WITH statement:

WITH post_counts AS (
  SELECT user_id, COUNT(*) as post_count
  FROM posts
  GROUP BY user_id
)

SELECT u.username, post_counts.post_count
FROM users u
JOIN post_counts ON u.id = post_counts.user_id
ORDER BY post_counts.post_count DESC;

In this query, the WITH statement creates a temporary table post_counts that contains each user’s ID and their total post count. The main query then joins this table with the users table to get the usernames associated with each ID, and orders the result by the post count in descending order.

This query would return the number of posts each user has made.

Subqueries are queries nested inside another query. A subquery can be used inside SELECT statements or inside another subquery. A subquery can return a set of records, a single record, or a single value.

Here’s an example of a subquery:

SELECT u.username 
FROM users u
WHERE u.id IN (
  SELECT p.user_id 
  FROM posts p
  GROUP BY p.user_id 
  HAVING COUNT(p.id) > 100
)

In this example, the subquery (SELECT p.user_id FROM posts p GROUP BY p.user_id HAVING COUNT(p.id) > 100) selects the user_id from the posts table for users who have made more than 100 posts. The main query then gets the username from the users table for those user_ids.

This query would return a list of usernames for users who have made more than 100 posts.

More Topics in this Series

8 Likes