This Data Explorer SQL query is designed to extract and count links from posts within public topics over a specified date range. It also allows for optional searching of links based on a specified URL pattern.
This query can be helpful in scenarios where understanding the usage of internal or external links in public discussions is important.
--[params]
--date :start_date = 2023-12-01
--date :end_date = 2024-01-01
--text :link_pattern = %example.com%
--boolean :search = false
SELECT
p.created_at,
p.id AS post_id,
p.topic_id AS topic_id,
COUNT(DISTINCT links) AS link_count,
array_agg(DISTINCT links) AS all_links
FROM
posts p
JOIN
topics t ON p.topic_id = t.id
CROSS JOIN LATERAL
regexp_matches(p.cooked, 'href="((?:http|https)://[^"]+)"', 'g') AS links
WHERE
p.created_at >= :start_date
AND p.created_at <= :end_date
AND p.cooked ~ 'href="http[s]?://'
AND t.archetype <> 'private_message' -- Exclude private messages
AND (:search = false OR links[1] LIKE :link_pattern) -- Filter links by pattern
GROUP BY
p.id,
p.topic_id,
p.post_number
ORDER BY
p.created_at ASC
SQL Query Explanation
The query performs the following operations:
- Join Operations: It joins the
poststable with thetopicstable to ensure that only posts belonging to public topics are considered. This is done by checking that thearchetypeof the topic is not ‘private_message’. - Pattern Matching: It uses a regular expression to extract all URLs from the
cookedfield of each post. This field contains the HTML-rendered version of the post content. - Filtering:
- Date Filtering: Only posts created within the user-specified date range (
start_datetoend_date) are included. - Link Filtering: Optionally, if the
searchparameter is set to true, only links that match thelink_patternare considered.
- Date Filtering: Only posts created within the user-specified date range (
- Aggregation: For each post, the query counts the distinct links and aggregates them into an array. This helps in understanding the variety of links shared in a single post.
- Output: The query outputs the creation date of the post, post ID, topic ID, count of unique links, and an array of these links.
Parameters
start_dateandend_date: Define the date range for the posts to be analyzed.link_pattern: A pattern to filter the links. Only links containing this pattern will be included ifsearchis set to true. In the:link_patternparameter the%symbols are wildcards in SQL that match zero or more characters.search: A boolean flag to activate the link pattern filtering.
Results
created_at: The date and time when the post was created.post_id: The unique identifier of the post.topic_id: The unique identifier of the topic to which the post belongs.link_count: The number of unique links found in the post. This indicates how many different external links were included in the post.all_links: An array of all unique links found in the post.
Example Results
| created_at | post_id | topic_id | link_count | all_links |
|---|---|---|---|---|
| 2023-12-05 14:30:00 | 10234 | 543 | 2 | {“http://example.com/page1”, “http://example.com/page2”} |
| 2023-12-12 09:15:00 | 10567 | 550 | 1 | {“http://example.com/page3”} |
| 2023-12-20 16:45:00 | 10894 | 560 | 3 | {“http://example.com/page1”, “http://example.com/info”, “http://example.com/contact”} |
| … | … | … | … | … |