What Googlebot sees when crawling Discourse

TL:DR, it’s semi-humanly possible to replicate Googlebot crawling Discourse. Here’s how to start…

Become one with Googlebot

  1. Open up an incognito window (always start fresh)
  2. Open up DevTools
  3. Open up Network Conditions in DevTools
  4. Uncheck ‘Use browser default’
  5. In the select menu choose Googlebot Smartphone
  6. Then go to https://meta.discourse.org (it looks way different; not a biggie because your Googlebot and bots don’t care)
  7. Go to View > Developer > View Source
  8. Copy it and paste it to some .html file

Good job! You have created the file Humans reference to see what Googlebot has crawled and cached.

Googlebot’s job is done. Now it’s time to render the cached file in a browser.

Become one with Chrome

  1. Open up the terminal and run npx http-server
  2. Navigate to the file
  3. Open up Chrome DevTools
  4. In the Elements panel, right-click <html... and select Copy outerHTML.
  5. That is the content that will be indexed, not cached, indexed

In summary, Googlebot retrieves the HTML and Chrome renders it. The rendered HTML is gold. Ensure your valuable content and links are appearing there.

2 Likes

What’s the purpose of the steps under “Become one with chrome”?

Couldn’t you do the “Copy outerHTML” step in place of step 7 in the first list?

I think you can also fetch it with curl:

curl -s https://meta.discourse.org/ > page.html

(It will contain the "crawler" classes.)

Then open the page.html file in a browser.

Or to inspect the code in an editor:

curl -s https://meta.discourse.org/ | vim -
2 Likes

The cached HTML is rendered in Chrome (headless). When rendered, supplemental copy and links may be introduced via JavaScript, in the dom. Google will take the information it renders into consideration for indexing.

This is how Googlebot gets content from JavaScript-heavy applications. Go to Google and search for something you know renders content only with JavaScript > click the 3 dotted icon > click the Cached button > click View source > copy it and render it in Chrome to see what content appears in the dom.

Note: Update any relative paths (CSS and JS resources) to absolutes before rendering it in Chrome ^^

Using curl makes it easier, nice!

Make sure to include the Googlebot user agent string, e.g., Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html). The server may send Googlebot different HTML.

I think it’s the same output, but it doesn’t hurt to add the user agent. I’m not sure about Chrome, but in Firefox you can right-click on the request in the network tab and choose “copy as curl” for a complete set of headers that will mimic a browser request.