What Googlebot sees when crawling Discourse

networkaaron · December 20, 2021, 8:05pm

TL:DR, it’s semi-humanly possible to replicate Googlebot crawling Discourse. Here’s how to start…

Become one with Googlebot

Open up an incognito window (always start fresh)
Open up DevTools
Open up Network Conditions in DevTools
Uncheck ‘Use browser default’
In the select menu choose Googlebot Smartphone
Then go to https://meta.discourse.org (it looks way different; not a biggie because your Googlebot and bots don’t care)
Go to View > Developer > View Source
Copy it and paste it to some .html file

Good job! You have created the file Humans reference to see what Googlebot has crawled and cached.

Googlebot’s job is done. Now it’s time to render the cached file in a browser.

Become one with Chrome

Open up the terminal and run npx http-server
Navigate to the file
Open up Chrome DevTools
In the Elements panel, right-click <html... and select Copy outerHTML.
That is the content that will be indexed, not cached, indexed

In summary, Googlebot retrieves the HTML and Chrome renders it. The rendered HTML is gold. Ensure your valuable content and links are appearing there.

david · December 20, 2021, 8:35pm

What’s the purpose of the steps under “Become one with chrome”?

Couldn’t you do the “Copy outerHTML” step in place of step 7 in the first list?

j127 · December 22, 2021, 1:50am

I think you can also fetch it with curl:

curl -s https://meta.discourse.org/ > page.html

(It will contain the "crawler" classes.)

Then open the page.html file in a browser.

Or to inspect the code in an editor:

curl -s https://meta.discourse.org/ | vim -

networkaaron · January 13, 2022, 2:22pm

The cached HTML is rendered in Chrome (headless). When rendered, supplemental copy and links may be introduced via JavaScript, in the dom. Google will take the information it renders into consideration for indexing.

This is how Googlebot gets content from JavaScript-heavy applications. Go to Google and search for something you know renders content only with JavaScript > click the 3 dotted icon > click the Cached button > click View source > copy it and render it in Chrome to see what content appears in the dom.

Note: Update any relative paths (CSS and JS resources) to absolutes before rendering it in Chrome ^^

networkaaron · January 13, 2022, 2:39pm

Using curl makes it easier, nice!

Make sure to include the Googlebot user agent string, e.g., Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html). The server may send Googlebot different HTML.

j127 · January 13, 2022, 9:13pm

I think it’s the same output, but it doesn’t hurt to add the user agent. I’m not sure about Chrome, but in Firefox you can right-click on the request in the network tab and choose “copy as curl” for a complete set of headers that will mimic a browser request.

Topic		Replies	Views
Googlebot is getting non-javascript version of the site Dev	16	1500	March 9, 2024
How public Discourse sites are indexed by search engines like Google Site Management reference	0	12610	February 6, 2013
Disable or bypass feature detect for Googlebot (while serving JS app to crawlers) Support unsupported-install	8	3194	June 14, 2022
Can we have a conversation about SEO? Dev	3	802	April 4, 2022
No content on homepage for Googlebot Bug	7	1866	March 16, 2016

What Googlebot sees when crawling Discourse

Related topics