Wordpress plugin and html-as-text (especially for mail)

We have the Wordpress plugin set up to post full topics to an announcements category on our forum. This works pretty well, but the email messages Discourse sends out are marked Content-Type: text/plain; charset=UTF-8.

As I understand it, this is Technically True, as the expected format is markdown, which is kind-of-plain-text. But, of course, markdown also, for better or worse, includes “html is valid too, yolo!”.

And, in fact, the posts coming from Wordpress are a bunch of HTML.

How are other people dealing with this? Is there a way to get the posted-to-discourse versions be some reasonable-as-human-text markdown after all? Or, is there a way to force HTML content type for the mail generated in this category? Or… any other ideas at all? Thanks!

2 Likes

Just to clarify @mattdm, when you say

By “messages” do you mean the email notifications you get from Discourse about new posts in your announcements category?

I’m just trying to understand the specific user pain point you’re trying to address here.

Yes, sorry. By “messages” I meant “e-mail messages”. And by “it” I meant Discourse, not the Wordpress plugin. I’ll edit the top post to make that clear.

In my imagination, the nicest thing would be for the email messages to be multi-part with a clean-text rendered markdown text/plain and a separate text/html. But I am not even sure how that would work.

Sorry, again just to clarify, are you raising this because email notifications of Discourse posts containing HTML (e.g. posts linked to Wordpress posts in your announcements category) incorrectly contain HTML entities? If so, could you share an example?

The reason I’m pressing on this is because the generation of email notifications in Discourse is quite separate from anything related to the Wordpress plugin. Email notifications have their own pipeline and there are multiple ways you can end up with HTML entites in a Discourse post, a post from Wordpress being only one.

In other words the fact that there is HTML in a Discourse post is a different issue from what email notifications about that post contain and how they’re encoded. Understanding the specific issue you’re having / raising will help to get the right eyes and attention on it.

I might be misunderstanding what’s going on, but here’s what I think is happening:

  1. The WordPress post is published.
  2. The plugin responds to that and creates a Discourse post
  3. Discourse posts are all Markdown, but it happens that the post coming from WordPress via the plugin is a mess of HTML (which is perfectly valid in Markdown)
  4. Users subscribed to notifications via email get a email containing that text — which looks like a mess of HTML.

I realize that someone could manually make a Discourse post with a whole bunch of HTML in the same way, but practically that’s not usually an issue (and were it to be one could be resolved mostly by a polite “hey, could you not?”).

I hope this makes sense.

An example: this post

Looks like this, both in Discourse if you go to edit the post and when sent out:

<small>Originally published at:			https://communityblog.fedoraproject.org/cpe-hiring-a-software-engineer/
		</small><br><p>The Community Platform Engineering group, or CPE for short, is the Red Hat team combining IT and release engineering for Fedora and CentOS. We currently have a position open for a <a href="https://global-redhat.icims.com/jobs/96157/software-engineer/job?mobile=true&amp;width=412&amp;height=732&amp;bga=true&amp;needsRedirect=false&amp;jan1offset=-480&amp;jun1offset=-420">software engineer in India</a>.</p>
<h2>About the role</h2>
<p>We are <a href="https://global-redhat.icims.com/jobs/96157/software-engineer/job?mobile=true&amp;width=412&amp;height=732&amp;bga=true&amp;needsRedirect=false&amp;jan1offset=-480&amp;jun1offset=-420">hiring new talent</a> to come work full time on Fedora, primarily as part of our Release Engineering group. You’ll get to work on the infrastructure that builds and ships the Fedora Linux release artifacts and updates. This role is perfect for anyone with experience or interest in Release Engineering.</p>
<h2>About CPE</h2>
<p>Our goal is to keep core servers and services running and maintained, build releases, and perform other strategic tasks that need more dedicated time than volunteers can give.</p>
<p>See <a href="https://docs.fedoraproject.org/en-US/cpe/">our docs</a> for more information. We are looking forward to meeting you and hopefully working with you soon!</p>

… which is not very useful.

Ok, I’m going to respond to the issues you’re raising here separately. I understand why you’re connecting them, but hopefully you’ll see why they’re separate issues.

HTML entites in plain text email notifications

the nicest thing would be for the email messages to be multi-part with a clean-text rendered markdown text/plain and a separate text/html

This is actually how Discourse email notifications currently work. If you look at the “original” of a Discourse email notification you’ll see there is a text version and a HTML version.

What you seem to be saying, but I’m still not 100% clear on this, is that you’re getting HTML entities in the plain text version of Discourse email notifications, the upshot being that you’re seeing the actual HTML entites in the body of the email when looking at it in an email client that doesn’t support HTML. Is that what you’re saying? Could you share a screenshot of this from an email client (that doesn’t support HTML)?

If this is the case this is an issue specific to Discourse email content generation and formatting and it’d be best to split that off into a more targeted topic in #support or #bug

HTML in Discourse posts

You’re raising a relevant issue here, but from a technical perspective the question lies with how Discourse approaches imported content more broadly. The current default for imported content is HTML, not markdown.

Other contexts in which you can see this is the RSS Polling plugin, which, like the WP Discourse plugin, imports HTML into the post content. Note also that the embed support markdown site setting is off by default and all the other site settings dealing with embedded HTML in posts (e.g. allowed embed selectors).

I’m partly guessing here, but the most likely reason(s) this strategic decision was taken in the early days of Discourse handling imported content was a combination of simplicity and fidelity, i.e. conversions from HTML to markdown will be imperfect. There is one key exception to this which I’ll mention below.

The WP Discourse plugin could attempt to convert the HTML of Wordpress posts to markdown before sending them to Discourse. Yes there are existing PHP libraries that convert HTML to markdown, but it’s never as simple as that when converting a markup language, particularly considering the different flavours of markdown.

Indeed the WP Discourse plugin attempting to handle the conversion would actually be misguided, considering there is already a custom HtmlToMarkdown converter in Discourse. Currently this converter handles the conversion of HTML to markdown in emails imported into Discourse. If the HTML of posts from Wordpress were to be converted to Discourse markdown it would need to be handled by that converter.

Currently the WP Discourse plugin uses the Discourse API to publish posts, i.e. the /posts endpoint. So essentially what you’re saying is that you want HtmlToMarkdown converter support to be added to the Discourse /posts endpoint (i.e. as an optional query param). You could advocate for this and if implemented the WP Discourse plugin would adopt it as an optional setting.

1 Like