Google Search Indexing and Discourse

Hi everyone!

I’ve read the various posts around the google crawler not having any difficulties indexing a discourse forum. My question is a bit different. Is each topic considered an indexable “page” by Google? The reason I ask is that a large part of the topics in our forum aren’t in Google’s database. This is confirmed by looking at the Google Search Console data:

Only around 17k entries exist, and the forums have over several hundred thousand topics (maybe millions?). The robots.txt errors are for pages that legitimately shouldn’t be indexed. What this seems like is that the crawler isn’t automatically visiting all of the older topics as well as it should.

Is there a setting that I need to toggle to ensure more of the older topics get indexed in a timely manner? For items above the fold, the indexing and results from Google are quite good. This is only impacting topics that happen to go below the fold.

Thanks,
Kirupa

2 Likes

For an experiment, I loaded Meta using the crawler view using the GoogleBot user-agent string. Then went all the way to the page 666 of our latest list, which are topics last bumped in mid-2017, almost 3 years ago.

One of the topics in the list is Deep-integration of Discourse within an Ionic App. I went to a not logged in Google search and the search for “integration discourse ionic” puts it on the very first result!!

Meta is a “small” instance with under 30k topics, but all appear to be indexed properly. Since it is an old domain and we are the #1 result for all things Discourse, we get enough “karma” with GoogleBot, so it runs on our domain for a sufficient time to crawl everything that is needed.

Did your forum went a migration from old software to Discourse?

5 Likes

If you need to rush indexing you could try the sitemap plugin

Standard crawling will catch everything but sitemap could possibly make stuff index faster

Please post results if you do

Also can you post 5 examples of great unique content you have on your forum that is 100% missing from google?

6 Likes

Perhaps Google also looks at the attendance of the topic (if there is a counter) or the topic has links to which there are active transitions. Google may not visit certain pages that it considers “not interesting” to users. There is a trick, it is usually checked like this. This is SEO. Put a link from some other resource and click on it. You do not need much, only a few (transitions). This is usually enough to interest Google. Where people go, there he is.

On large Google sites, it’s not enough to know that the page exists. He needs more signals. Activity, clickthroughs, views, etc.

1 Like

@Falco - yes, the forums did go through a migration from vBulletin, but that was towards the end of 2014. I have removed any public links to the old forums, so there isn’t a risk of duplicate content causing the search indexing to be bad.

@sam - yes, here are a few examples:

All of these posts were ones that I have posted about on twitter or a public Facebook page at some point in the last three years, so it isn’t something buried and hidden forever.

Regarding the sitemap plugin, let me give that a shot. I’ll post whatever data I am able to find. Thanks everyone for taking the time to help :slight_smile:

Cheers,
Kirupa

1 Like

That is my third result for “js using generators animate example”.

This may have been a bad example for me to post today, for I manually submitted that one for indexing a few hours ago as a test. This is what one of my forum admins saw earlier for this search term 7 hours ago:

You are correct that it is one of the top results right now. I wonder if the manual indexing had something to do with it.

EDIT: I just setup the Sitemap plug-in and will submit the sitemap to Google to index!

1 Like

Hi @kirupa,

FYI, when Google indexes two sites in the same domain, e.g. in your example kirupa.com with similar content, normally the “penalty” (it’s not really “penalty” per se, it is more like a “canonical selection”) in that Google’s algorithm selects one of the pages as canonical and that page will rank higher in the search results. (Google may even drop the page from the index it determines is not canonical).

Google has been quite clear about this, that the idea of a “duplicate content penalty is a myth” for the most part. It’s really about “canonicalization” and “selection”:

If your site contains multiple pages with largely identical content, there are a number of ways you can indicate your preferred URL to Google. (This is called “canonicalization”.) More information about canonicalization. (Ref 1)

For example, if you keep your old site up along with your new site, you can use the link canonical tag to tell Google your new site is the canonical site and Google will then prioritize your new site.

A better solution is to allow search engines to crawl these URLs, but mark them as duplicates by using the rel="canonical" link element, the URL parameter handling tool, or 301 redirects. In cases where duplicate content leads to us crawling too much of your website, you can also adjust the crawl rate setting in Search Console. (Ref 1)

Example:

<link rel="canonical" href="https://forum.kirupa.com/t/js-tip-of-the-day-using-generators-to-animate/643058" />

@kirupa, you also asked:

Is each topic considered an indexable “page” by Google? The reason I ask is that a large part of the topics in our forum aren’t in Google’s database.

For a great (but a bit dated) discussion on Google and infinite scroll, I recommend the Official Google Webmaster Central Blog, (Ref 2):

@kirupa, one way to check (in practical, not theory) is to use GSC and look at their “screenshot” of how they represent your page. This is easily done with the “check mobile friendly” function in GSC (for example); and if you take a very long post in Discourse, you can check to see how much of that page Google indexes (reads and indexes). There are a lot of opinions about infinite scroll and how Google indexes these pages. You can use GSC to check your pages and you can see for yourself.

According to Google’s Martin Splitt (See Reference 3), on April 14 2020:

Splitt provided the example of a news website that relies on infinite scroll (also referred to as “lazy loading) to load new content.

That means the web page, in this case the home page, does not load additional content until a visitor scrolls to the bottom of the screen.

Splitt explains why that’s a problem: “What does Googlebot not do? It doesn’t scroll.“

What Googlebot does is land on a page and crawl what is immediately visible.

According to what is stated by Splitt, Googlebot cannot crawl content that loads only after a page is scrolled.

As mentioned, @kirupa, you can check your own pages using tools in GSC which will show you a snap-shot of how Google views (and indexes) your pages.

According to Google’s Splitt in April 2020: "Googlebot doesn’t scroll.“ (paraphrasing)

Regarding the topic question of “Google search indexing and Discourse”, every site owner can easily use GSC to determine how Googlebot indexes a particular page.

My recommendation, and I hope this helps in some small way, is to use GSC (Google Search Console) to check your own pages if you have any questions how Googlebot indexes your pages.

Reference:

  1. Avoid creating duplicate content - Search Console Help

  2. Official Google Webmaster Central Blog: Infinite scroll search-friendly recommendations

  3. https://www.searchenginejournal.com/google-infinite-scroll-lazy-loading/363184/

5 Likes

Thanks for the really great response @neounix! I will go through and follow your suggestions shortly :slight_smile:

Un-hiding the old forums (kirupa.com/forum) and having the canonical meta tag on the new/active forum seems like a good idea. I will experiment with that this week.

In the interim, I submitted a sitemap with around 300k entries to Google Search Console.

2 Likes

Dear @kirupa,

You are welcome.

FYI.

Discourse forums already add the canonical tag to topics.

Here is a link from your forum, and the source showing it for one of your examples (above):

You can see that your discourse page already has a canonical tag.

One “trick” (unsupported but doable) is to add that same tag at your “old forums” (pointing to the new forums) or to at least make sure your old forums do not have a canonical tag.

However, to be honest, to get the correct topic id for the discourse forums in the database of your old forums requires some work (we did it for other reasons, so I know from our own experience it is doable because we use this info in both forums, presently).

There is a post custom fields database table in discourse which contains the mapping from your old forum (topic and post ids); and you could (if you wanted) dump that data from discourse and add that data to your old forums.

Then you could (if you wanted to, I am not recommending one way or the other) easily create a canonical tag in your old forums which point to your new discourse forums, if you so desire (based on your SEO and how you wish to approach this).

Some people prefer to 301 redirect the old forum pages. That’s all up to you and how you want to manage things! Keep in mind, if you want to 301 redirect, you will also need the mappings between the discourse topic (and post) ids and your old forum topic and post ids.

Hope this quick follow up helps @kirupa.

Best wishes and enjoy!

2 Likes