Detect encoding of RSS and Atom feeds from the XML declaration


(Camille Roux) #1


When I use the RSS/ATOM import with the following RSS (Posts on NosChangements):

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0"
		<title>Posts on NosChangements</title>
		<description>Recent content in Posts on NosChangements</description>
		<generator>Hugo --</generator>
		<copyright>Créé par [Camille Roux]( &lt;/br&gt; Sauf mention contraire, le contenu de ce site est sous licence [Creative Commons BY-NC-SA](</copyright>
		<lastBuildDate>Sat, 19 Aug 2017 00:00:00 +0000</lastBuildDate>
		<atom:link href="" rel="self" type="application/rss+xml"/>
			<title>#1 Je demande à ce qu&#39;on ne prescrive plus d&#39;homéopathie</title>
			<pubDate>Sat, 19 Aug 2017 00:00:00 +0000</pubDate>
			<description>🔍 Contexte Après avoir lu plusieurs articles scientifiques dans le passé sur le sujet, je savais que l&amp;rsquo;homéopathie n&amp;rsquo;avait pas plus d&amp;rsquo;effet que l&amp;rsquo;effet placebo.
Un jour, je suis allé chez ma médecin. Etant stressé par ce que j&amp;rsquo;avais, elle m&amp;rsquo;a prescrit un médicament contre l&amp;rsquo;anxiété. Une fois à la pharmacie, par curiosité, j&amp;rsquo;ai demandé si c&amp;rsquo;était de l&amp;rsquo;homéopathie. Réponse : &amp;ldquo;oui, monsieur&amp;rdquo;
Ma docteure m&amp;rsquo;avait prescrit de l&amp;rsquo;homéopathie sans demander mon consentement ni me prévenir !</description>

I get the following import (The emoji and accentuated characters are bad encoded):

(Sam Saffron) #2

@gerhard can you have a quick look

@CamilleRoux can you confirm you are on latest

(Camille Roux) #3


(Gerhard Schlager) #4

The server doesn’t return an encoding for the feed and unfortunately the charset detection chooses the wrong encoding.

Could you configure the webserver which serves to set a Content-Type of application/xml;charset=utf-8 for that file?

@sam I wonder if we should add a setting that lets users force an encoding for their feed. Detecting an encoding works only in same cases and short XMLs produce a lot of false positives.

Obviously the best way to solve the problem is to always serve RSS feeds with a Content-Type HTTP header that contains the correct charset.

(Sam Saffron) #5

I think it is legit to require whoever you are pulling from specifies the right charset, also possibly worth reporting this issue to the gem we are using.

(Camille Roux) #6

The charset is already specified here:

By the way I reported the idea of adding the header to Hugo team.

(Gerhard Schlager) #7

Yeah, we currently aren’t relying on the the XML encoding declaration. I guess the system could try to detect the correct encoding as described in XML - Autodetection of Character Encodings before it finally uses the rchardet gem to detect the most likeliest encoding.

I’m adding the #pr-welcome tag in case someone from the community wants to improve the behaviour.