Topic embedding 需要一些关注

simon · 2023 年4 月 24 日 22:32

今天，在点击 Introducing Discourse AI 的“显示完整帖子”按钮后，我被提醒了这一点。在 Discourse 上显示的完整帖子缺少所有图片和许多标题。更令人困惑的是，图片说明显示出来了，但没有关联的图片。

通过调整 Meta 的 allowed embed selectors 站点设置：https://meta.discourse.org/t/configure-the-allowed-embed-selectors-setting/134481，或许可以修复 Meta（Ghost？）博客上的问题。根据过去的经验，我知道获取此设置可能是一个棘手的过程。如果您尝试调整它，请密切关注结果。

Discourse 作为外部帖子的评论系统有很大的潜力，但要做好这项工作，“显示完整帖子”按钮需要可靠地拉取外部帖子的所有元素。我认为问题在于用于解析外部帖子的 Ruby Readability gem 并非用于 Discourse 使用它的目的。它也没有积极维护：https://github.com/cantino/ruby-readability。

Falco · 2023 年4 月 24 日 22:39

是的，在这一点上，我们要么转向其他能使其稍好一些的东西，要么改变嵌入策略，将 Show Full Post 改为 Read Full Post，这是一个指向原始帖子的简单链接。毕竟，与处理每个网站上所有可能的嵌入问题作斗争可能是徒劳的。

Falco · 2023 年4 月 27 日 14:38

@sam 刚修好了，看一下。

jordan-violet · 2023 年4 月 27 日 15:21

我们正准备将博客发布到 Ghost，并利用 Ghost 与 Discourse 的集成。很高兴看到这一变化！

simon · 2023 年4 月 27 日 17:54

图片现在正在被拉入。我不擅长“找不同”类型的谜题，但我仍然看到一些不同之处：

缺少“语义相关主题”标题
缺少“社区情绪”标题
“模块提供商”部分缺少无序列表
缺少“在您的社区上安装 Discourse AI”标题

理想情况下，“注册我们的时事通讯”提示应从嵌入的帖子中排除。

能够轻松引用嵌入的帖子似乎很重要。现在想来，我不确定当点击嵌入帖子的引用的“展开/折叠”和“转到帖子”按钮时，预期的行为是什么。

这是一个棘手的问题。它应该像清理帖子 article 或 main 元素中包含的 HTML 一样简单，但我怀疑该方法仍然存在问题。例如，如果 header 存在于 article 中，则需要一些特殊处理来防止重复博客帖子的 h1 元素。

sam · 2023 年4 月 28 日 01:13

我认为这在 readability.js 中也会发生，这是 Firefox 的阅读视图：

<h2 id="installing-discourse-ai-on-your-community">
      <strong>Installing Discourse AI on your community</strong>
    </h2>

GPT_Bot:

您提供的 HTML 在技术上是正确的，但不在 <h2> 标签内使用 <strong> 并非最佳实践。<h2> 标签本身就表示了更高的重要性，并且通常默认会显示为粗体。在标题内使用 <strong> 是多余的，并且可能导致代码不必要的复杂化。

更好的方法是：
<h2 id="installing-discourse-ai-on-your-community">
    Installing Discourse AI on your community
</h2>
如果您想为 <h2> 标签添加特定样式，最好使用 CSS 来实现。

看看是否有简单的方法可以解决这个问题……

不确定这一点……但如果我们真的真的想这样做，我们可以将 .discourse-newsletter-signup 添加到 blocked_embed_selectors

simon · 2023 年4 月 28 日 01:54

是的，readablity.js 基于与 GitHub - cantino/ruby-readability: Port of arc90's readability project to Ruby 相同的代码，因此很可能使用相同的逻辑来移除这些元素。不过，readablity.js 通常比 Ruby Readability 做得更好。

电子邮件的号召性用语 (CTA) 令人困惑，因为电子邮件输入框从嵌入式帖子中被剥离了。从技术上讲，我不确定号召性用语是否应该放在 article 标签内。

angus · 2023 年11 月 2 日 00:51

Just bumping this, as I agree with @simon that this should be re-thought at some point.

A fair chunk of support requests for the WP Discourse plugin are actually readability crawling issues of some form or another.

I think that sums up my gut on this.

That said, I don’t have a great solution at the moment besides this.

But I’m keen to contribute to a better solution than the status quo, as it would reduce the WP Discourse support workload.

sam · 2023 年11 月 2 日 04:26

他们确实会关注这些问题，但修复起来很慢……

设置 MiniRacer 来封装 readability 并不难……我做了一个原型。

我们有可能过渡到这个实现，但我们已经分叉了，所以最终会放弃一些功能。

这不是一个容易解决的问题。

angus · 2023 年11 月 2 日 05:16

是的，公平地说，但我认为那将是一场无休止的打地鼠游戏。总会有人问：

我网站上的帖子看起来像 X，当我点击“显示完整帖子”时，它看起来像 Y，而我希望它们是相同的。

我猜我更深层次的问题是，这种永远无法完美的功能是否真的有益，而不是

通过将其设为“显示完整帖子”按钮，人们期望获得 Discourse 永远无法完全实现的保真度。我更关心的是期望管理。

sam · 2023 年11 月 2 日 05:27

我猜你呼吁的是移除嵌入功能。我不确定我是否支持。我认为嵌入非常混乱内容的网站应该使用这种简单的“链接到原文”形式。然而，嵌入结构更好内容的网站可以使用阅读器模式，尽管它并不完美。

angus · 2023 年11 月 2 日 07:30

不一定。我说的是需要更好地管理预期。

99%的网站管理员不知道他们的 HTML 是否足够语义化，能够被 readability 这样的 gem 轻松解析，甚至不知道这个功能是如何工作的。用户的默认假设是“Discourse”（或者更常见的是 WP Discourse 插件）存在问题，当他们网站上的帖子与其内容在用户点击“显示完整帖子”时出现的内容之间没有 100% 的保真度时。

我认为，提供一个像“阅读完整帖子”这样的号召性用语（CTA）选项，并使其易于启用，甚至作为默认选项，会有所帮助。

simon · 2023 年11 月 2 日 10:21

我的意思是，Ruby Readability 是“一个用于提取网页主要可读内容的工具”。对于发布帖子到 Discourse 的网站，我认为可以安全地假设网页的主要可读内容是已知的，并且可以通过外部 CSS 选择器来定义。例如，article、.entry-content、.post 等。

我设想的工具将允许网站为其帖子的内容定义一个外部选择器，然后清理该选择器内的 HTML。一个稍微更复杂的版本将允许网站定义他们想要从 Discourse 中排除的内部选择器。

在我的 WordPress 网站上，我有一个具有完全标准标记的帖子。我想将 .entry-content div 中的所有内容发布到 Discourse。这几乎可行，但我无法弄清楚如何在 Discourse 上配置 allowed embed selector 设置来引入帖子的列表项。这是我看到网站挣扎过的问题。如果没有办法运行 Rails.cache.clear，配置起来真的很难。

将帖子发布为 onebox 是一个合理的解决方案。

编辑：debug 选项有助于弄清楚发生了什么：GitHub - cantino/ruby-readability: Port of arc90's readability project to Ruby WordPress 帖子中排除的列表：

Conditionally cleaned ul#. with weight 0 and content score 0 because it has too many links for its weight (0).

尽管如此，它仍然是一个完全合法的列表。

一个经常被问到的功能是，通过扩展嵌入来允许 YouTube 视频出现在扩展内容中。阻止这种情况发生是硬编码在 gem 中的：ruby-readability/lib/readability.rb at master · cantino/ruby-readability · GitHub PR 来通过一个选项来覆盖这个列表。

simon · 2023 年11 月 6 日 09:13

我不会对此过于兴奋，但周末我一直在用 Nokogiri 做其他事情。它有点令人上瘾。我想趁 Nokogiri 还在我脑海里的时候看看嵌入代码。

我对此感兴趣是因为我想看到 Discourse 被新闻和博客网站更广泛地使用。如果那样的话，我可以想象新的网站所有者会对当前的嵌入功能感到沮丧。这里有一个改进它的想法：

向 EmbeddableHost 模型添加两个新的可选属性：

target_selector：包含要嵌入的内容的外部 CSS 选择器
exclude_selectors：要从 target_selector 选择的内容中排除的 CSS 选择器列表。

应该在管理员/嵌入页面的每个 Embeddable Host 行上添加一个“配置”按钮。单击该按钮会打开一个类似于“电子邮件/预览摘要”页面的页面。

配置主机页面将有一个表单，用于输入主机的 target_selector 和 exclude_selectors 设置，以及一个 URL 字段，允许使用提供的 URL 值针对特定网页进行测试。测试基本上只是使用提供的 target_selector 和 exclude_selectors 值运行 TopicEmbed.parse_html，然后显示结果。

parse_html 代码的更改很容易测试。这是一种可能的方法。请注意，此代码仅为概念验证：

编辑到 topic_embed.rb (discourse/app/models/topic_embed.rb at main · discourse/discourse · GitHub)

###########################################################################
    # `target_selector` 和 `exclude_selectors` 最好从域的 `EmbeddableHost` 记录中找到
    # 这些特定的设置用于测试 boingboing.net
    target_selector = 'article'
    exclude_selectors = ['.article-header, .share-comments-container', '.boing-single-post-rev-content', '.next-post-list-container', '.boing-end-of-article-container-on-single-post-pages']

    if defined?(target_selector) && target_selector.present?
      read_doc = article_content(html, target_selector, exclude_selectors)
    else
      # 如果主机未设置 `target_selector`，则回退到 Readability
      read_doc = Readability::Document.new(html, opts)
    end
    ###########################################################################

为了在不创建新类的情况下进行测试，这里是将一个基本的 article_content 方法添加到 TopicEmbed 类中：

  def self.article_content(html, target_selector, exclude_selectors = [])
    doc = Nokogiri::HTML(html)
    # 删除注释和脚本标签
    doc.xpath('//comment()').each { |i| i.remove }
    doc.css("script, style").each { |i| i.remove }

    # 获取 target_selector 的 NodeSet
    # 如果返回的集合为空，则可能回退到使用 Readability
    selected_nodes = doc.css(target_selector)

    # 排除节点
    unless exclude_selectors.empty?
      selected_nodes.css(*exclude_selectors).each do |node|
        node.remove
      end
    end

    # 处理图像大小，可能需要改进
    selected_nodes.css('img').each do |img|
      img.remove_attribute('width')
      img.remove_attribute('height')
    end

    # 仅为好玩，如果 iframe 的源是允许的，则允许它们
    # 使用 `[data-sanitized="true"]` 防止 iframe 在 remove_empty_nodes 步骤中被剥离
    allowed_iframe_sources = SiteSetting.allowed_iframes.split('|')
    selected_nodes.css('iframe').each do |iframe|
      allowed = allowed_iframe_sources.any? do |allowed_source|
        iframe['src'].start_with?(allowed_source)
      end

      if allowed
        iframe['data-sanitized'] = 'true'
        iframe['width'] = '690'
        iframe['height'] = '388'
      else
        iframe.remove
      end
    end

    # 删除空的 'p' 和 'div' 节点
    selected_nodes.css('p', 'div').each do |node|
      node.remove if node.content.strip.empty? && !node.at_css('iframe[data-sanitized="true"]')
    end

    # 将节点转换为字符串并返回一个带有 `content` 方法的对象
    content = selected_nodes.to_s
    OpenStruct.new(content: content)
  end

我很有信心，只需在多个域上进行一些调整就可以使其正常工作。到目前为止，我从 BBS 获得的结果一直很好。

目标是提出一个网站所有者可以轻松理解和自行配置的东西。通过这种方法，target_selector 越具体，配置 exclude_selectors 就越容易。例如，对于 WordPress 网站，如果选择 .entry-content 作为 target_selector，则无需进一步配置。如果网站所有者想获得比基本 .entry-content HTML 更多的内容，他们可以在配置主机页面上找出如何做到这一点。

唯一真正的问题是对于 HTML 非常不一致的主机。这种情况可以通过将 Ruby Readability 保留为备用方案来处理。

话题		回复	浏览量
How to customize the text in an embedded post? Support	16	1894	2024 年5 月 21 日
Embed Discourse comments on another website via Javascript Integrations embedding , how-to	123	301142	2025 年10 月 27 日
Embed Discourse comments with on-page commenting? Feature	32	7941	2019 年1 月 19 日
"Show Full Post" button doesn't work in subfolder installations Support embedding , subfolder	33	406	2026 年1 月 2 日
Fix broken images for posts created by the WP Discourse and RSS plugins Administrators rss-polling , wordpress , how-to	33	4180	2021 年8 月 20 日

Topic embedding 需要一些关注

相关话题