Multi-line HTML comments not hidden on post from wp-discourse

Hello,

I’m not sure if this is an issue with wp-discourse specifically or with the html parser in Discourse itself, but it looks like multi-line HTML comments aren’t hidden in posts generated by wp-discourse.

I’ve just finished setting up chinwag.pluralistic.net, based on the pluralistic.net posts from Cory Doctorow. His posts include some metadata at the top in the form of HTML comments. Recently, this format changed to multiline instead of single-line.

you can see in this post:

The single-line comment from the origin article is removed:

https://pluralistic.net/2020/07/04/pluralistic-04-jul-2020/

However, the multiline comment from a later post is displayed directly:

2 Likes

I don’t think the problem is being caused by the WordPress plugin. Here’s some example markup that is not interpreted as a comment by Discourse:

<p><!--
Tags:


Summary:
New podcast; Europe's interop coalition; Scarfolk beermats; Miami cop owns illegal mansion nightclub; Video and transcript of my OII talk; Shower temperature vs handle position

URL:
https://pluralistic.net/2020/07/06/polbathic/

Title:
Pluralistic: 06 Jul 2020 polbathic

Bullet:
🧔🏿

Separator:
_,.-'~'-.,__,.-'~'-.,__,.-'~'-.,__,.-'~'-.,__,.-'~'-.,_

Top Sources:
Today's top sources: Fipi Lele, Naked Capitalism (https://www.nakedcapitalism.com/).

--><br></p>

Renders as:

The problem is that the empty lines within the comment are interpreted as paragraphs by Discourse. When that is done, the HTML comment is no longer valid markup. If the empty lines are removed from within the comment, it will be interpreted correctly by Discourse:

<p><!--
Tags:
Summary:
New podcast; Europe's interop coalition; Scarfolk beermats; Miami cop owns illegal mansion nightclub; Video and transcript of my OII talk; Shower temperature vs handle position
URL:
https://pluralistic.net/2020/07/06/polbathic/
Title:
Pluralistic: 06 Jul 2020 polbathic
Bullet:
🧔🏿
Separator:
_,.-'~'-.,__,.-'~'-.,__,.-'~'-.,__,.-'~'-.,__,.-'~'-.,_
Top Sources:
Today's top sources: Fipi Lele, Naked Capitalism (https://www.nakedcapitalism.com/).
--><br></p>

Other than editing the post content before publishing the posts to Discourse, I’m not sure what the best approach would be for dealing with this.

3 Likes

Probably the best way to deal with this would be to remove the comments from the post before publishing the post to Discourse. Maybe the plugin should do this by default, but this is the first time I’ve come across this issue.

The WP Discourse plugin has a filter that can be hooked into to parse post content before it is published to Discourse. The filter is called wp_discourse_excerpt . The post content is passed to it as a variable. Here is how that filter can be used to remove all comments from the WordPress post before it gets published to Discourse. This pattern could be used to alter the post in other ways too:

add_filter( 'wp_discourse_excerpt', 'wpdc_custom_discourse_excerpt' );
function wpdc_custom_discourse_excerpt( $html ) {
	if ( ! extension_loaded( 'libxml' ) ) {

		return $html;
	}

	$use_internal_errors   = libxml_use_internal_errors( true );
	$disable_entity_loader = libxml_disable_entity_loader( true );
	$html_doc              = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/></head><body>' . $html . '</body></html>';
	$doc                   = new \DOMDocument( '1.0', 'utf-8' );
	$doc->loadHTML( $html_doc );
	$finder   = new \DOMXPath( $doc );
	$comments = $finder->query( '//comment()' );
	if ( $comments->length ) {
		foreach ( $comments as $comment ) {
			$comment->parentNode->removeChild( $comment );
		}

		$parsed = $doc->saveHTML( $doc->documentElement );
		libxml_clear_errors();
		libxml_use_internal_errors( $use_internal_errors );
		libxml_disable_entity_loader( $disable_entity_loader );

		return preg_replace( '~<(?:!DOCTYPE|/?(?:html|head|meta|body))[^>]*>\s*~i', '', $parsed );
	}

	libxml_clear_errors();
	libxml_use_internal_errors( $use_internal_errors );
	libxml_disable_entity_loader( $disable_entity_loader );

	return $html;
}

Adding that code to your WordPress theme’s functions.php file should solve the problem. If it is not working for you, check to be sure that the libxml extension is enabled on your WordPress site’s server.

3 Likes

I will indeed give this a shot, but it sounds like a larger issue if HTML entities in a comment result in Discourse decising that comment is now invalid HTML, no?

1 Like

I’m not sure how Discourse should handle this. I think that for posts that are created directly on Discourse it makes sense to say that only a limited subset of HTML is supported, but for posts that are created through the API, or by pulling in an RSS feed it’s not clear to me how far Discourse should go to support HTML.

I think it would make sense to add the code that I posted above directly to the WP Discourse plugin. The plugin is already using similar functionality to clean up comments that are pulled to WordPress from Discourse. That has been in use for some time now with no reports of problems. I’ll get back to you about this within the next couple of days.

3 Likes

This has been added in WP Discourse version 2.0.6. When the full post content is published from WordPress to Discourse, any comment blocks in the post will be removed before the post is sent to Discourse.

4 Likes

Thanks, Simon! I’ll install as soon as this is available and report back.

ETA: Works like a charm. Thanks!

4 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.