This Data Explorer SQL query is designed to extract and count links from posts within public topics over a specified date range. It also allows for optional searching of links based on a specified URL pattern.
This query can be helpful in scenarios where understanding the usage of internal or external links in public discussions is important.
--[params]
--date :start_date = 2023-12-01
--date :end_date = 2024-01-01
--text :link_pattern = %example.com%
--boolean :search = false
SELECT
p.created_at,
p.id AS post_id,
p.topic_id AS topic_id,
COUNT(DISTINCT links) AS link_count,
array_agg(DISTINCT links) AS all_links
FROM
posts p
JOIN
topics t ON p.topic_id = t.id
CROSS JOIN LATERAL
regexp_matches(p.cooked, 'href="((?:http|https)://[^"]+)"', 'g') AS links
WHERE
p.created_at >= :start_date
AND p.created_at <= :end_date
AND p.cooked ~ 'href="http[s]?://'
AND t.archetype <> 'private_message' -- Exclude private messages
AND (:search = false OR links[1] LIKE :link_pattern) -- Filter links by pattern
GROUP BY
p.id,
p.topic_id,
p.post_number
ORDER BY
p.created_at ASC
SQL Query Explanation
The query performs the following operations:
- Join Operations: It joins the
posts
table with thetopics
table to ensure that only posts belonging to public topics are considered. This is done by checking that thearchetype
of the topic is not ‘private_message’. - Pattern Matching: It uses a regular expression to extract all URLs from the
cooked
field of each post. This field contains the HTML-rendered version of the post content. - Filtering:
- Date Filtering: Only posts created within the user-specified date range (
start_date
toend_date
) are included. - Link Filtering: Optionally, if the
search
parameter is set to true, only links that match thelink_pattern
are considered.
- Date Filtering: Only posts created within the user-specified date range (
- Aggregation: For each post, the query counts the distinct links and aggregates them into an array. This helps in understanding the variety of links shared in a single post.
- Output: The query outputs the creation date of the post, post ID, topic ID, count of unique links, and an array of these links.
Parameters
start_date
andend_date
: Define the date range for the posts to be analyzed.link_pattern
: A pattern to filter the links. Only links containing this pattern will be included ifsearch
is set to true. In the:link_pattern
parameter the%
symbols are wildcards in SQL that match zero or more characters.search
: A boolean flag to activate the link pattern filtering.
Results
created_at
: The date and time when the post was created.post_id
: The unique identifier of the post.topic_id
: The unique identifier of the topic to which the post belongs.link_count
: The number of unique links found in the post. This indicates how many different external links were included in the post.all_links
: An array of all unique links found in the post.
Example Results
created_at | post_id | topic_id | link_count | all_links |
---|---|---|---|---|
2023-12-05 14:30:00 | 10234 | 543 | 2 | {“http://example.com/page1”, “http://example.com/page2”} |
2023-12-12 09:15:00 | 10567 | 550 | 1 | {“http://example.com/page3”} |
2023-12-20 16:45:00 | 10894 | 560 | 3 | {“http://example.com/page1”, “http://example.com/info”, “http://example.com/contact”} |
… | … | … | … | … |