SEO Problems with RSS duplicate content

We use the RSS plugin (or manual copies) for our blog posts to create them as a discussion anker in a category. Google does’nt like such “copied” content and threatens to damage the SEO reputation of the blog.

Of course we can stop google indexing the discourse (or the category), but has anybody experience with somehow defining that this is intended. Something like “this subdomain belongs to the blog, we don’t try to make link farms” or something like that? If so, how to implement that in the discourse settings.

I believe to remember roughly, that with rel=nofollow or similar it should at least address the backlink-farming aspect (not sure about the content duplicate aspect). Is there maybe a header “this is a copy of” which apeaces Google?

2 Likes

The embed set canonical url site setting might help with the problem:

It’s worth looking at Google’s documentation though:

The reason I linked to the docs is because I’m not certain about the case of enabling the embed set canonical url setting when the embed truncate setting is enabled. When embed truncate is enabled, only a snippet of the original article is actually available to be crawled by Google on Discourse. The full article is displayed in an iframe if users click the “Show Full Post” button. I’m fairly sure the iframe content is not crawled by Google. The first point in the “5 common mistakes” article kind of addresses that issue.

2 Likes

Thanks for the hint Simon! Indeed canonical seems to be what I was thinking about. I tried it but it does not complete work, it embeds a canonical URL which points to itself, not to the RSS source:

<link rel="canonical" href="https://community.domain.com/t/invoicing-mandate/537?page=0" />

instead of

<link rel="canonical" href="https://blog.domain.com/t/invoicing-mandate/537" />

(This is stable/3.2.0)

Also, can/should this be set as featured link as well?

In our case we dont use truncate (however the RSS Feed is already truncated). But I hope Google will accept it anyway.

1 Like

Try viewing the page source, instead of viewing it with your browser’s web inspector. I think you’ll see that the canonical URL is set to the RSS post’s URL when you view the page’s source, and set to the Discourse topic’s URL when you view the HTML with the web inspector. If that’s correct, you shouldn’t be getting duplicate content warnings for the RSS topics.

Here’s what I’m seeing (with embed set canonical url enabled ) when I view a topic pulled in from Discourse’s RSS feed in my web inspector:

And here’s the canonical URL when I view the page’s source (by right-clicking on the page and selecting “View page source” from the menu):

<link rel="canonical" href="https://blog.discourse.org/2023/03/how-discourse-scaled-to-10m-arr-with-only-1-salesperson" />

The view page source version with the canonical URL correctly set is what a crawler will see.

The other way to see the difference is to use the web inspector, but select Googlebot as the user agent:

I think the setting is working as expected in terms of what crawler’s see, but it was confusing me. The issue seems to be that Discourse overwrites the canonical URL attribute with Javascript when the page is viewed with Javascript enabled. For reference, that happens here:

I don’t think it’s (currently) possible to have the featured link set for topics created from RSS feeds. I’m not an SEO expert, but I don’t think that setting it or not setting it would have any effect on SEO.

2 Likes

I used curl -i | grep canon and saw a wrong tag url (and no header), but I can try again with different UA (that is a bit strange though ,) - had to re-create the posts a few times so maybe I was confused. Will update here.

True, the featured link is not for SEO but I had internally the wish to make the blog link more visible. And since it’s the same url…

(But it looks like I get a longer list of requirements, so I might need to fork the rss-poll (unfortunately it looks like most work is not done in the plug-in, though). Is the embed code also extensible?

For a topic created from an RSS feed, with embed set canonical url enabled, I’d expect curl -i to return the RSS item’s URL as the canonical URL. That works when I test it on my local site.

Assuming you have access to the Discourse Rails console, you can confirm what’s going on by finding the topic, then checking its topic_embed property. For example:

t = Topic.find 495
t.topic_embed

or just:

TopicEmbed.find_by(topic_id: 495)

A TopicEmbed should be returned. Its embed_url is what’s expected to be used by Discourse to set the topic’s canonical URL.

I’ve wondered about that myself. It would be more difficult than making changes to the RSS plugin, because embedding is part of the core Discourse code.