There is a bug in the Discourse app related to updating the <link rel="canonical"> meta data element in the <head> section of the Discourse DOM.
Basically, when a browser client enters the application, and the application is first loaded, the <link rel="canonical" href=""> element will set according to that initial page load; but then when a user clicks in the app (normal user behavior), without reloading the page manually, the <link rel="canonical"> link will not update.
I have tested this bug and reproduced it on the meta site:
Fig 1. Enter meta from the home page, the canonical link is correct, as it the title element.
This bug could adversely effect SEO because when Google indexes the page, if Googlebot is not “hard reloading” every page, the canonical information will be incorrect for each page (as in the image sequences above).
Reproducibility
I have reproduced this bug consistently on both the meta site and our site.
Notes
I have seen these kind of node.js (SPA) lifecycle issues before with other web frameworks (not only Ember) where DOM elements are not updated, based on (Ember and other SPA framework) lifecycle hooks within the web application framework.
We serve a completely different document for crawlers as not every crawler can run javascript and we want discourse to be accessible for those clients too, even if they receive reduced functionality they can consume all content.
Now, I realize that some earlier discussions about SPA, “infinite scroll” and other SEO related issues were completely wrong, since the SPA is not served to GoogleBot.
This changes my approach to some custom code I wrote recently; and now I know to check using the GoogleBot UA in the console.
Thanks so much for that, @Falco ! Much appreciated.
Question:
What is the best way to add a single custom javascript file to the HTML which is rendered to GoogleBot?
Is there a “standard way” to modify the HTML served to bots?
The reason I ask is that we have some custom code which was created in a plugin I wrote (meant for bots); but I checked using the GoogleBot UA in the console (thanks again for telling me that I need to do that), and none of that custom plugin code is consumed by GoogleBot.
In the interim, since I cannot accomplish what I want in a (handlebars-based) plugin for HTML served to crawlers, we have decided to simply strip out the canonical tags from Discourse, which is a partial solution for now until I can figure out how to modify the canonical tag with some Javascript for web crawlers.
Discourse provides a nice mechanism for these kinds of changes in the container yml files, so that is what I have done today.
I am very grateful to Discourse meta for pointing out that the Discourse app served to (identified) crawlers is not the same as the pages served to users.
Please note that I am not recommending this “interim solution” to other Discourse sys admins. I am simply sharing what I have decided to do, at this time, and how I did it (until we come up with a more interesting solution).