Find Links in Topics

This Data Explorer SQL query is designed to extract and count links from posts within public topics over a specified date range. It also allows for optional searching of links based on a specified URL pattern.

This query can be helpful in scenarios where understanding the usage of internal or external links in public discussions is important.

--[params]
--date :start_date = 2023-12-01
--date :end_date = 2024-01-01
--text :link_pattern = %example.com%
--boolean :search = false

SELECT
  p.created_at,
  p.id AS post_id,
  p.topic_id AS topic_id,
  COUNT(DISTINCT links) AS link_count,
  array_agg(DISTINCT links) AS all_links
FROM
  posts p
JOIN
  topics t ON p.topic_id = t.id
CROSS JOIN LATERAL
  regexp_matches(p.cooked, 'href="((?:http|https)://[^"]+)"', 'g') AS links
WHERE
  p.created_at >= :start_date 
  AND p.created_at <= :end_date 
  AND p.cooked ~ 'href="http[s]?://'
  AND t.archetype <> 'private_message' -- Exclude private messages
  AND (:search = false OR links[1] LIKE :link_pattern) -- Filter links by pattern
GROUP BY
  p.id,
  p.topic_id,
  p.post_number
ORDER BY 
  p.created_at ASC

SQL Query Explanation

The query performs the following operations:

  • Join Operations: It joins the posts table with the topics table to ensure that only posts belonging to public topics are considered. This is done by checking that the archetype of the topic is not ‘private_message’.
  • Pattern Matching: It uses a regular expression to extract all URLs from the cooked field of each post. This field contains the HTML-rendered version of the post content.
  • Filtering:
    • Date Filtering: Only posts created within the user-specified date range (start_date to end_date) are included.
    • Link Filtering: Optionally, if the search parameter is set to true, only links that match the link_pattern are considered.
  • Aggregation: For each post, the query counts the distinct links and aggregates them into an array. This helps in understanding the variety of links shared in a single post.
  • Output: The query outputs the creation date of the post, post ID, topic ID, count of unique links, and an array of these links.

Parameters

  • start_date and end_date: Define the date range for the posts to be analyzed.
  • link_pattern: A pattern to filter the links. Only links containing this pattern will be included if search is set to true. In the :link_pattern parameter the % symbols are wildcards in SQL that match zero or more characters.
  • search: A boolean flag to activate the link pattern filtering.

Results

  • created_at: The date and time when the post was created.
  • post_id: The unique identifier of the post.
  • topic_id: The unique identifier of the topic to which the post belongs.
  • link_count: The number of unique links found in the post. This indicates how many different external links were included in the post.
  • all_links: An array of all unique links found in the post.

Example Results

created_at post_id topic_id link_count all_links
2023-12-05 14:30:00 10234 543 2 {“http://example.com/page1”, “http://example.com/page2”}
2023-12-12 09:15:00 10567 550 1 {“http://example.com/page3”}
2023-12-20 16:45:00 10894 560 3 {“http://example.com/page1”, “http://example.com/info”, “http://example.com/contact”}
1 Like