Detect encoding of RSS and Atom feeds from the XML declaration

Hi,

When I use the RSS/ATOM import with the following RSS (Posts on NosChangements):

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0"
	xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>Posts on NosChangements</title>
		<link>https://www.noschangements.fr/posts/</link>
		<description>Recent content in Posts on NosChangements</description>
		<generator>Hugo -- gohugo.io</generator>
		<language>fr_FR</language>
		<copyright>Créé par [Camille Roux](https://www.camilleroux.com/) &lt;/br&gt; Sauf mention contraire, le contenu de ce site est sous licence [Creative Commons BY-NC-SA](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.fr)</copyright>
		<lastBuildDate>Sat, 19 Aug 2017 00:00:00 +0000</lastBuildDate>
		<atom:link href="https://www.noschangements.fr/posts/index.xml" rel="self" type="application/rss+xml"/>
		<item>
			<title>#1 Je demande à ce qu&#39;on ne prescrive plus d&#39;homéopathie</title>
			<link>https://www.noschangements.fr/posts/1-homeophatie/</link>
			<pubDate>Sat, 19 Aug 2017 00:00:00 +0000</pubDate>
			<guid>https://www.noschangements.fr/posts/1-homeophatie/</guid>
			<description>🔍 Contexte Après avoir lu plusieurs articles scientifiques dans le passé sur le sujet, je savais que l&amp;rsquo;homéopathie n&amp;rsquo;avait pas plus d&amp;rsquo;effet que l&amp;rsquo;effet placebo.
Un jour, je suis allé chez ma médecin. Etant stressé par ce que j&amp;rsquo;avais, elle m&amp;rsquo;a prescrit un médicament contre l&amp;rsquo;anxiété. Une fois à la pharmacie, par curiosité, j&amp;rsquo;ai demandé si c&amp;rsquo;était de l&amp;rsquo;homéopathie. Réponse : &amp;ldquo;oui, monsieur&amp;rdquo;
Ma docteure m&amp;rsquo;avait prescrit de l&amp;rsquo;homéopathie sans demander mon consentement ni me prévenir !</description>
		</item>
	</channel>
</rss>

I get the following import (The emoji and accentuated characters are bad encoded):

1 Like

@gerhard can you have a quick look

@CamilleRoux can you confirm you are on latest

1 Like

yep!

1 Like

The server doesn’t return an encoding for the feed and unfortunately the charset detection chooses the wrong encoding.

Could you configure the webserver which serves https://www.noschangements.fr/posts/index.xml to set a Content-Type of application/xml;charset=utf-8 for that file?

@sam I wonder if we should add a setting that lets users force an encoding for their feed. Detecting an encoding works only in same cases and short XMLs produce a lot of false positives.

Obviously the best way to solve the problem is to always serve RSS feeds with a Content-Type HTTP header that contains the correct charset.

5 Likes

I think it is legit to require whoever you are pulling from specifies the right charset, also possibly worth reporting this issue to the gem we are using.

2 Likes

The charset is already specified here:

By the way I reported the idea of adding the header to Hugo team.

4 Likes

Yeah, we currently aren’t relying on the the XML encoding declaration. I guess the system could try to detect the correct encoding as described in XML - Autodetection of Character Encodings before it finally uses the rchardet gem to detect the most likeliest encoding.

I’m adding the #pr-welcome tag in case someone from the community wants to improve the behaviour.

5 Likes