Detect encoding of RSS and Atom feeds from the XML declaration

Hi,

When I use the RSS/ATOM import with the following RSS (Posts on NosChangements):

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0"
	xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>Posts on NosChangements</title>
		<link>https://www.noschangements.fr/posts/</link>
		<description>Recent content in Posts on NosChangements</description>
		<generator>Hugo -- gohugo.io</generator>
		<language>fr_FR</language>
		<copyright>Créé par [Camille Roux](https://www.camilleroux.com/) &lt;/br&gt; Sauf mention contraire, le contenu de ce site est sous licence [Creative Commons BY-NC-SA](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.fr)</copyright>
		<lastBuildDate>Sat, 19 Aug 2017 00:00:00 +0000</lastBuildDate>
		<atom:link href="https://www.noschangements.fr/posts/index.xml" rel="self" type="application/rss+xml"/>
		<item>
			<title>#1 Je demande à ce qu&#39;on ne prescrive plus d&#39;homéopathie</title>
			<link>https://www.noschangements.fr/posts/1-homeophatie/</link>
			<pubDate>Sat, 19 Aug 2017 00:00:00 +0000</pubDate>
			<guid>https://www.noschangements.fr/posts/1-homeophatie/</guid>
			<description>🔍 Contexte Après avoir lu plusieurs articles scientifiques dans le passé sur le sujet, je savais que l&amp;rsquo;homéopathie n&amp;rsquo;avait pas plus d&amp;rsquo;effet que l&amp;rsquo;effet placebo.
Un jour, je suis allé chez ma médecin. Etant stressé par ce que j&amp;rsquo;avais, elle m&amp;rsquo;a prescrit un médicament contre l&amp;rsquo;anxiété. Une fois à la pharmacie, par curiosité, j&amp;rsquo;ai demandé si c&amp;rsquo;était de l&amp;rsquo;homéopathie. Réponse : &amp;ldquo;oui, monsieur&amp;rdquo;
Ma docteure m&amp;rsquo;avait prescrit de l&amp;rsquo;homéopathie sans demander mon consentement ni me prévenir !</description>
		</item>
	</channel>
</rss>

I get the following import (The emoji and accentuated characters are bad encoded):

إعجابَين (2)

@gerhard can you have a quick look

@CamilleRoux can you confirm you are on latest

إعجابَين (2)

yep!

إعجاب واحد (1)

The server doesn’t return an encoding for the feed and unfortunately the charset detection chooses the wrong encoding.

Could you configure the webserver which serves https://www.noschangements.fr/posts/index.xml to set a Content-Type of application/xml;charset=utf-8 for that file?

@sam I wonder if we should add a setting that lets users force an encoding for their feed. Detecting an encoding works only in same cases and short XMLs produce a lot of false positives.

Obviously the best way to solve the problem is to always serve RSS feeds with a Content-Type HTTP header that contains the correct charset.

5 إعجابات

I think it is legit to require whoever you are pulling from specifies the right charset, also possibly worth reporting this issue to the gem we are using.

إعجابَين (2)

The charset is already specified here:

By the way I reported the idea of adding the header to Hugo team.

4 إعجابات

Yeah, we currently aren’t relying on the the XML encoding declaration. I guess the system could try to detect the correct encoding as described in XML - Autodetection of Character Encodings before it finally uses the rchardet gem to detect the most likeliest encoding.

I’m adding the pr-welcome tag in case someone from the community wants to improve the behaviour.

5 إعجابات