I was reminded of this today after clicking the “Show Full Post” button for Introducing Discourse AI. The full post that is displayed on Discourse is missing all images and many headings. Adding to the confusion, image captions are displayed, but without their associated images.
It might be possible to fix the issue on Meta for its (Ghost?) blog by adjusting Meta’s allowed embed selectors site setting: Configure the Allowed Embed Selectors Setting. From past experience, I know that getting this setting can be a tricky process. If you try adjusting it, pay close attention to the results.
Discourse has a lot of potential to function as a comment system for external posts, but to do a good job of this, clicking the “Show Full Post” button needs to reliably pull in all elements of the external post. I think the issue is that the Ruby Readability gem that’s used for parsing external posts isn’t intended for the job that Discourse is using it for. It’s also not being actively maintained: GitHub - cantino/ruby-readability: Port of arc90's readability project to Ruby.
Yes, at this point we either move to something else that makes it slightly better or just change the embedding strategy into making the Show Full Post into a Read Full Post that is a simple link to the original post. It may be pointless fighting with all the possible embed problems in every website afterall.
The images are now getting pulled in. I’m not great at “spot the difference” types of puzzles, but I’m still seeing some differences:
Semantic Related Topics title missing
Community Sentiment title missing
missing unordered list in the Modules Providers section
Installing Discourse AI on your community title missing
Ideally, the “Sign up for our newsletter” prompt would be excluded from the embedded post.
Having the ability to easily quote the embedded post seems important. Thinking about that now, I’m not sure what the expected behaviour is when the “expand/collapse” and “go to post” buttons are clicked for an embedded post’s quotes.
It’s a tricky problem. It should be as simple as sanitizing the HTML that’s contained in a post’s article or main element, but I suspect there would still be issues with that approach. For example, it would require some special handling to prevent duplication of a blog post’s h1 element if the header exists inside of the article.