Block crawlers from embedded topics only?

Other than manually, is it possible to set up robots.txt to block all WP-Discouse-connected forum threads?
(so that the forum post isn’t indexed)

A simple toggle option that can block all connected forum threads from crawling? Or even a per-post toggle setting when publishing or editing a WP post that is linked to wp-discourse forum thread?

Why do you want to do that? I’m fairly certain that the forum post includes a pointer to the canonical version on your WP site.

2 Likes

A drop in ranking/reputation of WP’s articles (originals) since addition. I was not looking to start a debate or complain, just asking how to achieve this. I’m sure there are many other reasons others may want to have this level of control for what’s indexed.

1 Like

Just suggesting - perhaps you should re-define your goals because you are heading to wrong direction.

But with reverse proxy that is quite trivial task.

Not all web crawlers honor the robots.txt file.

3 Likes

Hey @haydenjames,

There’s nothing you can do in the WP Discourse plugin to add posts it creates in Discourse to a robots.txt file. This is actually just a pure Discourse question, namely “Can I automatically noindex embedded topics?” (or something along those lines). A topic embedded from Wordpress is functionally the same as any other embedded topic. The avenue of investigation you want to pursue is there, for example the origin of the embed set canonical url site setting and related discussions.

I don’t think (but happy to be corrected) that what you want to do is a current Discourse feature. Discourse currently adds a X-Robots-Tag: noindex header to GET requests for hidden topics. You could do the same for embedded topics via a plugin.

4 Likes

Heading in the wrong direction to block indexing of a forum thread with the duplicate article that I prefer Google search users to find via the WP blog? I’m ok with that. The Wp-discourse main benefit for me has been allowing discussions of blog posts without having to use solutions like Disqus or the very limited default WP comments. I don’t need any SEO benefit from the forums unless it’s other unique threads that are not connected to already existing content.

There isn’t any duplicates unless you have changed something.

Because of:

And:

1 Like

Thanks @angus

To clarify, if I make the category that stores the WP-discourse connected post hidden (is hidden different to private?) then it will hide the post from the forums/public/crawlers but the inserted comments at the end of each Wordpress blog post with comments will still be visible?

Sorry about the noob questions, I’m not experienced with Discourse and want to make sure I’m not misinterpreting your response.

…depends on your definition of duplicate. Canonical is in place, but for me personally, since both the Blog post and the forum thread contain the exact paste/text (duplicate). I would like just to block those threads altogether. That’s just my preference. Maybe in the future, the reasoning behind this topic will make more sense. But for now, I am honestly not trying to provoke a debate or anything like that. I think think that blocking is a more absolute solution for me.

It’s like going to your mechanic and asking him to “change your oil twice”. I understand the initial “why” by @angus - but in the end, it’s just about whether it can be done somehow, or not possible.

Edit: Now thinking about it, I could then just add the blog post forum category to robots.txt, correct? Or will that be overwritten? (I will search the forums for how Discourse robots.txt works/can be edited.

So something like:
forum.domain.com/c/blog-articles/xx/*

A “hidden” topic is one that isn’t listed on topic lists, i.e. it isn’t “discoverable” in the normal fashion. You can tell a hidden topic by the eye with a line through it symbol.

Screenshot 2023-06-19 at 20.37.37

Actually there is a way to automatically make posts from the WP Discourse plugin “hidden” :slight_smile: You can use the “Publish as Unlisted Topics” setting.

Keep in mind both what I said up top, and what it says next to that setting. This will mean that topics published from Discourse to Wordpress do not appear on the topic lists of your forum. Comments will work in the normal fashion. If you have the sync comment data webhook enabled the topic will no longer be hidden after the first comment. That feature wasn’t exactly designed for this purpose. See further

If you want to just add a X-Robots-Tag: noindex header to an embedded topic (without bothering about this hidden business), you’ll need to either request that as a new feature of Discourse itself or add it via a plugin.

2 Likes

This is cool. Thanks for clarifying and sharing the WP Discourse setting.

Question: If I manually edit my Discourse robots.txt file. Will the change remain persistent?

I’m still in the process of searching the forums for that answer. Will insert any links I find that answers that.

If you do it via /admin/customize/robots it will persist.

It gets stored in a hidden site setting called overridden_robots_txt. If that’s filled it will always be served as your robots.txt file.

2 Likes

@haydenjames The one final thing I’d note is that there seems to have been an issue with the canonical url of embedded topics recently. Something to keep in mind if you just noticed this issue recently.

1 Like

Thanks. Ahh, it’s not that simple, because each thread’s URL is without the category in said URL. So would have to add them manually / on-by-one.

Noted. thanks. That is partly why I would like the nuke-approach of just blocking all WP-Discourse embed posts via robots.txt. These things can happen. It’s understandable.

My definition, or yours, is meaningless. Only definition of Google is important. And then there is no duplicates.

There is a chance too where Google values your forum higher than wordpress. Then the solution in not trying block indexing, but fix that origin.

1 Like

Even though the rel=canonical tag can help you avoid a duplicate content penalty when you republish posts, you can still get penalized if you misuse the tag. I’ll find a solution. Will bump this thread at a later date.