I won’t get too carried away with this, but I was using Nokogiri for something else over the weekend. It’s kind of addictive. I figured I’d have a look at the embedding code while Nokogiri was fresh on my mind.
My interest in this is that I’d like to see Discourse used more widely by news and blogging sites. If that was to happen, I can imagine new site owners getting frustrated with the current embedding functionality. Here’s one idea for improving it:
Add two new optional attributes to the EmbeddableHost
model:
target_selector
: the outer css selector that contains the content that is to be embedded
exclude_selectors
: a list of css selectors that are to be excluded from the content selected by the target_selector
.
A “Configure” button should be added to each Embeddable Host row on the Admin / Embedding page. Clicking that button opens a page that’s similar to the Emails / Preview Summary page.
The Configure Host page would have a form with fields for entering the host’s target_selector
and exclude_selectors
settings, and a URL field that allowed the supplied values to be tested against a specific web page. The test would essentially just run TopicEmbed.parse_html
with the supplied target_selector
and exclude_selectors
values, then display the results.
Changes to the parse_html
code are easy to test. Here’s a possible approach. Note this code is just a proof of concept:
edited into topic_embed.rb
(https://github.com/discourse/discourse/blob/main/app/models/topic_embed.rb#L157)
###########################################################################
# `target_selector` and `exclude_selectors` would ideally be found from the domain's `EmbeddableHost` record
# these particular settings were used for testing against boingboing.net
target_selector = 'article'
exclude_selectors = ['.article-header, .share-comments-container', '.boing-single-post-rev-content', '.next-post-list-container', '.boing-end-of-article-container-on-single-post-pages']
if defined?(target_selector) && target_selector.present?
read_doc = article_content(html, target_selector, exclude_selectors)
else
# fallback to Readability if `target_selector` isn't set for the host
read_doc = Readability::Document.new(html, opts)
end
###########################################################################
For testing without creating a new class, here’s a basic article_content
method added to the TopicEmbed
class:
def self.article_content(html, target_selector, exclude_selectors = [])
doc = Nokogiri::HTML(html)
# remove comments and script tags
doc.xpath('//comment()').each { |i| i.remove }
doc.css("script, style").each { |i| i.remove }
# get the NodeSet for the target_selector
# maybe fall back to using Readability here if the retured set is empty
selected_nodes = doc.css(target_selector)
# exclude nodes
unless exclude_selectors.empty?
selected_nodes.css(*exclude_selectors).each do |node|
node.remove
end
end
# deal with image sizes, might need improvement
selected_nodes.css('img').each do |img|
img.remove_attribute('width')
img.remove_attribute('height')
end
# just for the heck of it, allow iframes if their source is allowed
# use `[data-sanitized="true"]` to prevent iframes from being stripped in the remove_empty_nodes step
allowed_iframe_sources = SiteSetting.allowed_iframes.split('|')
selected_nodes.css('iframe').each do |iframe|
allowed = allowed_iframe_sources.any? do |allowed_source|
iframe['src'].start_with?(allowed_source)
end
if allowed
iframe['data-sanitized'] = 'true'
iframe['width'] = '690'
iframe['height'] = '388'
else
iframe.remove
end
end
# remove empty 'p' and 'div' nodes
selected_nodes.css('p', 'div').each do |node|
node.remove if node.content.strip.empty? && !node.at_css('iframe[data-sanitized="true"]')
end
# convert the nodes to a string and return an object with a `content` method
content = selected_nodes.to_s
OpenStruct.new(content: content)
end
I’m fairly sure it would just take a bit of fiddling against multiple domains to get it right. The results I’ve been getting for BBS are good so far.
The goal is to come up with something that site owners can easily understand and configure on their own. With this approach, the more specific the target_selector
is, the easier it would be to configure the exclude_selectors
. For example, for a WordPress site, if .entry-content
was selected as the target_selector
, no further configuration would be required. If site owners wanted to get more than the basic .entry-content
html, they could figure out how to do that on the Configure Host page.
The only real issue I can see is for hosts with very inconsistent HTML. That case could be dealt with by keeping Ruby Readability as a fallback.