Topic embedding needs some love

I was reminded of this today after clicking the “Show Full Post” button for Introducing Discourse AI. The full post that is displayed on Discourse is missing all images and many headings. Adding to the confusion, image captions are displayed, but without their associated images.

It might be possible to fix the issue on Meta for its (Ghost?) blog by adjusting Meta’s allowed embed selectors site setting: Configure the Allowed Embed Selectors Setting. From past experience, I know that getting this setting can be a tricky process. If you try adjusting it, pay close attention to the results.

Discourse has a lot of potential to function as a comment system for external posts, but to do a good job of this, clicking the “Show Full Post” button needs to reliably pull in all elements of the external post. I think the issue is that the Ruby Readability gem that’s used for parsing external posts isn’t intended for the job that Discourse is using it for. It’s also not being actively maintained: GitHub - cantino/ruby-readability: Port of arc90's readability project to Ruby.

3 Likes

Yes, at this point we either move to something else that makes it slightly better or just change the embedding strategy into making the Show Full Post into a Read Full Post that is a simple link to the original post. It may be pointless fighting with all the possible embed problems in every website afterall.

4 Likes

@sam just fixed this, take a look.

3 Likes

We’re getting ready to release our blog on Ghost and make use of the Ghost > Discourse integrations. Really happy to see this change!

4 Likes

The images are now getting pulled in. I’m not great at “spot the difference” types of puzzles, but I’m still seeing some differences:

  • Semantic Related Topics title missing
  • Community Sentiment title missing
  • missing unordered list in the Modules Providers section
  • Installing Discourse AI on your community title missing

Ideally, the “Sign up for our newsletter” prompt would be excluded from the embedded post.

Having the ability to easily quote the embedded post seems important. Thinking about that now, I’m not sure what the expected behaviour is when the “expand/collapse” and “go to post” buttons are clicked for an embedded post’s quotes.

It’s a tricky problem. It should be as simple as sanitizing the HTML that’s contained in a post’s article or main element, but I suspect there would still be issues with that approach. For example, it would require some special handling to prevent duplication of a blog post’s h1 element if the header exists inside of the article.

1 Like

I think this is all happening even in readablity.js, this is firefox reader view:

<h2 id="installing-discourse-ai-on-your-community">
      <strong>Installing Discourse AI on your community</strong>
</h2>

Will see if there is an easy way to fix this…

Not sure about this… but if we really really want to do that we can add .discourse-newsletter-signup to blocked_embed_selectors

4 Likes

Yeah, readablity.js is based on the same code as GitHub - cantino/ruby-readability: Port of arc90's readability project to Ruby, so probably the same logic is being used to remove those elements. readablity.js generally does a better job than Ruby Readability though.

The email CTA is confusing because the email input gets stripped from the embedded post. Technically, I’m not sure the CTA belongs inside the article.

1 Like

Just bumping this, as I agree with @simon that this should be re-thought at some point.

A fair chunk of support requests for the WP Discourse plugin are actually readability crawling issues of some form or another.

I think that sums up my gut on this.

That said, I don’t have a great solution at the moment besides this.

But I’m keen to contribute to a better solution than the status quo, as it would reduce the WP Discourse support workload.

1 Like

They do look at issues, but are slow to fix them…

Setting it up so MiniRacer wraps up readability is not too hard … I did a prototype of this.

It is possible that we could transition to this implementation, but we also diverged already so we would end up giving up features

This is not an easy problem to solve.

2 Likes

Yeah, fair, however I feel like that will be an endless game of whack-a-mole. There will always be some version of:

The post on my website looks like X and when I click “Show Full Post” it looks like Y and I want them to be identical.

I guess my deeper question is is there a real benefit to this functionality, which will never be perfect, over

By making it a “Show Full Post” button people expect a fidelity that Discourse can never fully deliver. My concern is more expectation management.

I guess what you are calling for is removal of the embedding feature. I am not sure I am behind it. I think sites that are embedding very messy content should use this simple “link to original” form. However sites that are embedding better structured contents can make use of the reader mode albeit it being imperfect.

1 Like

Not necessarily. I’m saying there needs to be better expectation management.

99% of people running a website are not going to know wether their HTML is sufficiently semantic to be easily parsed by a gem like readability, or even that that is what is determinative of how the feature works. The default assumption of users is that there is an issue “in Discourse” (or more often in the WP Discourse plugin) when there isn’t 100% fidelity between the post on their site and the content that appears when the user clicks “Show Full Post”.

Making an option like having a “Read Full Post” CTA easy to enable, and perhaps the default, would help I think.

2 Likes

What I meant by this is that Ruby Readability is “a tool for extracting the primary readable content of a webpage.” For the case of a site that’s publishing posts to Discourse, I think it’s safe to assume that the primary readable content of the webpage is known and can be defined by an outer css selector. For example, article, .entry-content, .post, etc.

The kind of tool I’m imagining would just allow sites to define an outer selector for their post’s content, then sanitize the HTML that was contained within that selector. A slightly more sophisticated version would allow sites to define inner selectors that they wanted to exclude from Discourse.

On my WordPress site I have a post with completely standard markup. I’d like to publish everything that’s in the .entry-content div to Discourse. It almost works, but I can’t figure out how to configure the allowed embed selector setting on Discourse to pull in the post’s list elements. This is the kind of issue I’ve seen site’s struggle with. Without being able to run Rails.cache.clear, it’s really tricky to configure.

Publishing the post as a onebox is a reasonable solution for this.

Edit: the debug option is useful for figuring out what’s going on: GitHub - cantino/ruby-readability: Port of arc90's readability project to Ruby. For the case of the excluded lists in my WordPress post:

Conditionally cleaned ul#. with weight 0 and content score 0 because it has too many links for its weight (0).

It’s a perfectly legitimate list though.

A much asked for feature with expanded embeds is to allow Youtube videos to appear in the expanded content. Preventing that from happening is hardcoded into the gem: https://github.com/cantino/ruby-readability/blob/master/lib/readability.rb#L410-L412. I’m not sure if it’s worth making a PR to be able to override that list with an option.

2 Likes

I won’t get too carried away with this, but I was using Nokogiri for something else over the weekend. It’s kind of addictive. I figured I’d have a look at the embedding code while Nokogiri was fresh on my mind.

My interest in this is that I’d like to see Discourse used more widely by news and blogging sites. If that was to happen, I can imagine new site owners getting frustrated with the current embedding functionality. Here’s one idea for improving it:

Add two new optional attributes to the EmbeddableHost model:

  • target_selector: the outer css selector that contains the content that is to be embedded
  • exclude_selectors: a list of css selectors that are to be excluded from the content selected by the target_selector.

A “Configure” button should be added to each Embeddable Host row on the Admin / Embedding page. Clicking that button opens a page that’s similar to the Emails / Preview Summary page.

The Configure Host page would have a form with fields for entering the host’s target_selector and exclude_selectors settings, and a URL field that allowed the supplied values to be tested against a specific web page. The test would essentially just run TopicEmbed.parse_html with the supplied target_selector and exclude_selectors values, then display the results.


Changes to the parse_html code are easy to test. Here’s a possible approach. Note this code is just a proof of concept:

edited into topic_embed.rb (https://github.com/discourse/discourse/blob/main/app/models/topic_embed.rb#L157)

###########################################################################
    # `target_selector` and `exclude_selectors` would ideally be found from the domain's `EmbeddableHost` record
    # these particular settings were used for testing against boingboing.net
    target_selector = 'article'
    exclude_selectors = ['.article-header, .share-comments-container', '.boing-single-post-rev-content', '.next-post-list-container', '.boing-end-of-article-container-on-single-post-pages']

    if defined?(target_selector) && target_selector.present?
      read_doc = article_content(html, target_selector, exclude_selectors)
    else
      # fallback to Readability if `target_selector` isn't set for the host
      read_doc = Readability::Document.new(html, opts)
    end
    ###########################################################################

For testing without creating a new class, here’s a basic article_content method added to the TopicEmbed class:

  def self.article_content(html, target_selector, exclude_selectors = [])
    doc = Nokogiri::HTML(html)
    # remove comments and script tags
    doc.xpath('//comment()').each { |i| i.remove }
    doc.css("script, style").each { |i| i.remove }

    # get the NodeSet for the target_selector
    # maybe fall back to using Readability here if the retured set is empty
    selected_nodes = doc.css(target_selector)

    # exclude nodes
    unless exclude_selectors.empty?
      selected_nodes.css(*exclude_selectors).each do |node|
        node.remove
      end
    end

    # deal with image sizes, might need improvement
    selected_nodes.css('img').each do |img|
      img.remove_attribute('width')
      img.remove_attribute('height')
    end

    # just for the heck of it, allow iframes if their source is allowed
    # use `[data-sanitized="true"]` to prevent iframes from being stripped in the remove_empty_nodes step
    allowed_iframe_sources = SiteSetting.allowed_iframes.split('|')
    selected_nodes.css('iframe').each do |iframe|
      allowed = allowed_iframe_sources.any? do |allowed_source|
        iframe['src'].start_with?(allowed_source)
      end

      if allowed
        iframe['data-sanitized'] = 'true'
        iframe['width'] = '690'
        iframe['height'] = '388'
      else
        iframe.remove
      end
    end

    # remove empty 'p' and 'div' nodes
    selected_nodes.css('p', 'div').each do |node|
      node.remove if node.content.strip.empty? && !node.at_css('iframe[data-sanitized="true"]')
    end

    # convert the nodes to a string and return an object with a `content` method
    content = selected_nodes.to_s
    OpenStruct.new(content: content)
  end

I’m fairly sure it would just take a bit of fiddling against multiple domains to get it right. The results I’ve been getting for BBS are good so far.

The goal is to come up with something that site owners can easily understand and configure on their own. With this approach, the more specific the target_selector is, the easier it would be to configure the exclude_selectors. For example, for a WordPress site, if .entry-content was selected as the target_selector, no further configuration would be required. If site owners wanted to get more than the basic .entry-content html, they could figure out how to do that on the Configure Host page.

The only real issue I can see is for hosts with very inconsistent HTML. That case could be dealt with by keeping Ruby Readability as a fallback.