Hi @kirupa,
FYI, when Google indexes two sites in the same domain, e.g. in your example kirupa.com
with similar content, normally the “penalty” (it’s not really “penalty” per se, it is more like a “canonical selection”) in that Google’s algorithm selects one of the pages as canonical and that page will rank higher in the search results. (Google may even drop the page from the index it determines is not canonical).
Google has been quite clear about this, that the idea of a “duplicate content penalty is a myth” for the most part. It’s really about “canonicalization” and “selection”:
If your site contains multiple pages with largely identical content, there are a number of ways you can indicate your preferred URL to Google. (This is called “canonicalization”.) More information about canonicalization. (Ref 1)
For example, if you keep your old site up along with your new site, you can use the link canonical
tag to tell Google your new site is the canonical site
and Google will then prioritize your new site.
A better solution is to allow search engines to crawl these URLs, but mark them as duplicates by using the rel="canonical"
link element, the URL parameter handling tool, or 301 redirects. In cases where duplicate content leads to us crawling too much of your website, you can also adjust the crawl rate setting in Search Console. (Ref 1)
Example:
<link rel="canonical" href="https://forum.kirupa.com/t/js-tip-of-the-day-using-generators-to-animate/643058" />
@kirupa, you also asked:
Is each topic considered an indexable “page” by Google? The reason I ask is that a large part of the topics in our forum aren’t in Google’s database.
For a great (but a bit dated) discussion on Google and infinite scroll, I recommend the Official Google Webmaster Central Blog, (Ref 2):
@kirupa, one way to check (in practical, not theory) is to use GSC and look at their “screenshot” of how they represent your page. This is easily done with the “check mobile friendly” function in GSC (for example); and if you take a very long post in Discourse, you can check to see how much of that page Google indexes (reads and indexes). There are a lot of opinions about infinite scroll and how Google indexes these pages. You can use GSC to check your pages and you can see for yourself.
According to Google’s Martin Splitt (See Reference 3), on April 14 2020:
Splitt provided the example of a news website that relies on infinite scroll (also referred to as “lazy loading) to load new content.
That means the web page, in this case the home page, does not load additional content until a visitor scrolls to the bottom of the screen.
Splitt explains why that’s a problem: “What does Googlebot not do? It doesn’t scroll.“
What Googlebot does is land on a page and crawl what is immediately visible.
According to what is stated by Splitt, Googlebot cannot crawl content that loads only after a page is scrolled.
As mentioned, @kirupa, you can check your own pages using tools in GSC which will show you a snap-shot of how Google views (and indexes) your pages.
According to Google’s Splitt in April 2020: "Googlebot doesn’t scroll.“ (paraphrasing)
Regarding the topic question of “Google search indexing and Discourse”, every site owner can easily use GSC to determine how Googlebot indexes a particular page.
My recommendation, and I hope this helps in some small way, is to use GSC (Google Search Console) to check your own pages if you have any questions how Googlebot indexes your pages.
Reference:
-
How to Specify a Canonical with rel="canonical" and Other Methods | Google Search Central | Documentation | Google for Developers
-
Official Google Webmaster Central Blog: Infinite scroll search-friendly recommendations
-
Google’s Martin Splitt Explains Why Infinite Scroll Causes SEO Problems