I apologize if there is an easy solution to this problem but I have searched this forum to no avail. I am trying to optimize our new Discourse forum for SEO, and it seems that multiple pages containing duplicate content can be accessed at different URLs, thus hurting search engine ranking by “splitting” the traffic for each duplicated page. So imagine we have some content at:
Is there any built-in support to do this in Discourse? This issue seems to really be hurting our search engine rankings as just about every category page has duplicate URLs at something like */l/latest?category_id=1&page=1. I do not mind doing some minor tinkering in the ruby-rails backend to get this done, but we would prefer not to dive into any complex hacks.
Google is usually smart enough to understand that query strings can produce identical content with a near infinite number of variables used in said query strings.
Where is your proof that this is “hurting rankings”? Do you have data? There is a lot of snake oil in the SEO “industry”.
As Jeff says though, for most sites I leave that setting at “Let Google decide how to interpret the url parameters” and seems just fine.
However, this doesn’t apply when two pages have similar content but different urls (not counting the url parameters). For example foo.com/category1 and foo.com/category1/latest will be seen as different pages, regardless of how you tweak the url parameters settings in WMT. The OP is correct that it’d be best to specify a cannonical URL for any two pages that have distinct urls and identical or nearly identical content.
I have seen that the platform have a few seo navigation and crawl issues. Here are the problems
Point 1. You’ve implemented rel=‘next’/‘prev’ but not as a meta tag but as a item prop. You’ve put in canonical only in Topics but not properly in Categories. feature - Discourse Meta & feature - Discourse Meta both have the same canonical <link rel="canonical" href="https://meta.discourse.org/c/feature" /> whereas it should be different for both the pages.
Please cite specific words and sentences that are not being correctly met there, on that page.
Searchers commonly prefer to view a whole article or category on a single page. Therefore, if we think this is what the searcher is looking for, we try to show the View All page in search results. You can also add a rel=“canonical” link to the component pages to tell Google that the View All version is the version you want to appear in search results.
See above. “VIEW ALL”. That would be the category root page…
What you say is right, but the problem is that there is no “VIEW ALL” page in discourse forums. I am citing examples of even this forum.
Let us take Feature category, the URI is feature - Discourse Meta and if you look at it through user-agent as Googlebot/Bingbot/Slurp! you can see the page is broken into pagination feature - Discourse Meta then page=2 and so on upto page=72 containing 30 links each but all of these pages have the same canonical: <link rel="canonical" href="https://meta.discourse.org/c/feature" /> so you are leaving out all the other 71 pages from being indexed.
You can even check out the Google Cache if you don’t believe me at feature - Discourse Meta
or site: command site:https://meta.discourse.org/c/feature - Google Search
But the same does not hold true for Topics pages where the canonical changes for each page. And this is the correct way to do this.
And if you are at it, can we keep the same pagination style in both NOSCRIPT as well as PushState().
Sorry to keep disturbing if you are finding this annoying.
Specifying a rel=canonical from page 2 (or any later page) to page 1 is not correct use of rel=canonical, as these are not duplicate pages. Using rel=canonical in this instance would result in the content on pages 2 and beyond not being indexed at all.
So your advice was incorrect. The correct thing to do, per Google webmaster guidelines, is not to render canonical at all on paginated content.
@techapj can you make sure that’s the case in every common scenario? It is definitely the case on /latest (homepage).
do we want users entering discourse sites on a category filter page?
do we want users entering discourse sites on a category filter page on page 100
do we want users to get a hit on a “list” style page in the expense of hitting the right topic?
do we want users entering on top page (probably yes)
is a site map desirable to increase crawling efficiency ?
Having the same canonical for all the pages on the list stuff for non topics heavily deemphasises them as search results, something that is desirable
I wonder if we should even allow robots to index any of the filters except for latest.
You can get to every topic on the site through latest, the fact we allow all this slice and diced crawling does make crawling activity much less effective, as Google keeps on rediscovering the same content over and over
We simply need to analyze our logs first and see how big the problem is, there is huge appeal in decreasing crawling load and increasing crawling efficiency it makes all sites faster and better
But we need to be ultra careful here not to cause any unwanted side effects that take months to rectify