How to Noindex thin & duplicate pages in Discourse?


(Ultra Noob) #1

Just a feedback: It would be great if we can avoid thin content using noindex tag.

Indexing of Safe mode can be disallowed using noindex tag.

Search Results , guess we do not need to show in the SERPs. May be we can noindex it too.

Login Page
image

Subpages

Result


How we can make this better?

I understand this will take time, and need to be done carefully.

Step 1. We need indentify which pages are likely to create duplicate content

1. /search?*
2. /c/
3. /u
4. /top*
4. /latest?*
5. ?preview_theme_id=*
6. *?*
7. /tags*

Step 2. We should apply noindex wherever we see thin pages.

Step 3. We should use noindex tag instead robots.txt disallow rule.

Step 4. We need avoid creating duplicates content pages, like / = /latest (homepage & latest).
or /guidelines = /faq as possible as.

Step 5. We also need to avoid indexing of thin pages such as safe mode.

Thanks & Regards,
Gulshan


#2

Even I am facing this issue. Though, no solution yet.


(Stephen) #3

There are already several topics on this matter that are easily identified by a simple search:

Maybe familiarise yourself there and help continue the discussion? You’ve already participated in this topic in one other thread as it is, do we need the duplicates?


(Ultra Noob) #4

I have gone through those topics in past one month, this issue is practically more visible with references. So, I have started again.


(Stephen) #5

So you’ve decided to tackle the topic of duplicate search entries by creating a duplicate thread? Interesting.


#6

Below replies are more interesting, no solution till now.

Feb 22, 2016

Jan 30, 2017


(Evgeny) #7

I think you can use this to create a plugin. Where to add everything you want.

Or completely redefine the file, which is a bad idea.


(Ultra Noob) #8

The purpose is noindex thin pages. I doubt robots.txt will help. Ideally, using noindex meta robots tag might be a fix. For now, I have implemented a workaround, will update here if it works for me.


(Robert McIntosh) #10

Despite earlier suggestions, I do see that there is a possible conversation here that relates to areas for possible SEO improvements beyond tags. However, it does not look like these are critical issues at this stage so it is something we should discuss and clarify so we can give it a priority and be clear about what we are trying to achieve.

@Gulshan_Kumar we appreciate the research, but if you would like to clarify with some constructive suggestions and expand on the descriptions maybe others can comment and add thoughts


(Ultra Noob) #12

This research is an effort to tackle down web spam. I believe we should focus on indexing only quality pages in the Search Engine and no way accidently indexing thin pages which add no to zero value such as “Search results” pages.

Google doesn’t recommend thin content, they want to show quality information in the search result. If we keep indexing low quality pages, it may invite “manual action” kind of penalty.

Ref: Google

  1. Official Google Webmaster Central Blog: How we fought webspam - Webspam Report 2017

  2. Manual Actions Report - Search Console Help

Thanks for taking time to read my response.


(Mittineague) #13

That page is already in the robots.txt
https://meta.discourse.org/robots.txt

I have not seen any posts here by any members that have gotten a manual action. Have you?


(Ultra Noob) #14

Nope!

It is below rule

Disallow: /search
Disallow: /search/

Doesn’t fit for actual search results parameter.
https://meta.discourse.org/search?context=topic&context_id=97878&skip_context=true

It should be

Disallow: /search?*

Like this.

BTW, robots.txt is not ideal solution for noindex.

To properly noindex any webpage,

  1. Please do not block it by robots.txt file. Here’s why
  2. Use noindex directive. Period.
<!DOCTYPE html>
<html><head>
<meta name="robots" content="noindex" />
(…)
</head>
<body>(…)</body>
</html>

Or, The X-Robots-Tag can be used as an element of the HTTP header response for a given URL.

There is no other way than using noindex directive.

Not yet. But I feel it’s safe to index only quality pages rather thin pages.

Even that will look better in SERPs. Nobody like web spam.


(Mittineague) #15

I don’t know about every search engine, but for Google, the two are equivalent

https://developers.google.com/search/reference/robots_txt

/fish*
Equivalent to /fish . The trailing wildcard is ignored.

Matches:

  • /fish
  • /fish.html
  • /fish/salmon.html
  • /fishheads
  • /fishheads/yummy.html
  • /fish.php?id=anything

i.e. it’s a “begins with” not an “is exactly”


(Jeff Atwood) #16

@david what are your thoughts? Might be good to add login and safe mode url paths to the robots.txt exclusions at least.


(Ultra Noob) #17

Have a look
image

Info Source: John, Google

Therefore, I am not completely agree with the mentioned link information.

At least in my case, robots.txt is not working for the Search Results.

Why it is indexed with full snippet if it is blocked by robots.txt file?


(Felix Freiberger) #19

The proper reaction is a noindex tag. As I’ve mentioned elsewhere, robots.txt prevents crawling, not indexing – Google will still index the page if it sees links to this page, even though it cannot access it itself.


(Jeff Atwood) #20

And why should I or anyone else care about that distinction?


(Felix Freiberger) #21

Why are we listing files in robots.txt anyways? I thought we did that because we want Google and others to only list real content pages, not pages that we consider too risky (user profiles) or unimportant to show in search results.


(Jeremy M) #22

This is the same reason why I created this thread around the /latest pages which are thin content and shouldn’t be indexed:


(Mittineague) #23

There seems to be two perspectives here.

  1. include paths in robots.txt so search engines won’t crawl or index them while at the site.
  2. include paths in robots.txt so search engines won’t display them in SERPs.

As mentioned, the robots.txt file is more a suggestion than a rule. A path that can be found outside of the forum can be followed and indexed.

What would constitute a rule for something to not be displayed in SERPs would be either an HTTP header or a tag in the <head>.

Using robots.txt to limit what is displayed in SERPs works in general but is not a guarantee.

My personal feeling is that by setting various Discourse content so that the everyone group does not have read permission works well in ensuring that content does not get displayed in SERPs.