How public Discourse sites are indexed by search engines like Google

:bookmark: This reference explains how public Discourse sites work with search engines like Google and how the platform ensures content is properly indexed even as a JavaScript application.

:person_raising_hand: Required user level: All users

Search engine indexing of Discourse sites

Discourse is built as a JavaScript application, but is specifically designed to ensure search engines can properly crawl and index all content on public sites.

How Discourse supports search engine crawlers

While Discourse uses modern JavaScript for its interactive features, it implements several techniques to ensure search engines can properly index all content:

Dedicated crawler layout

Discourse automatically detects search engine bots by their user agent using the CrawlerDetection module. When a crawler is detected, Discourse serves a completely separate server-rendered HTML layout (crawler.html.erb) instead of the normal JavaScript application. This crawler layout includes:

  1. Full HTML-rendered topic content and topic lists — no JavaScript required
  2. Schema.org structured data markup (e.g., DiscussionForumPosting, ItemList, BreadcrumbList) to help search engines understand your content
  3. Proper pagination with rel="prev" and rel="next" links to allow complete crawling
  4. Last-Modified headers on topic pages to signal content freshness

No-JavaScript fallback

For regular browsers that have JavaScript disabled, Discourse also includes a <noscript> tag in the standard application layout. This contains rendered topic lists and topic content, ensuring the site remains accessible even without JavaScript.

Robots.txt and indexing controls

Discourse provides several settings to control how search engines interact with your site:

allow_index_in_robots_txt

This site setting (enabled by default) controls whether your site’s robots.txt permits crawling. When disabled, the robots.txt will disallow all crawlers and an X-Robots-Tag: noindex header is added to all responses.

Crawler allowlist and blocklist

You can control which crawlers are allowed to access your site using:

  • allowed_crawler_user_agents — when set, only the listed crawlers are permitted; all others are blocked via robots.txt
  • blocked_crawler_user_agents — when set, the listed crawlers are blocked while all others are allowed

Custom robots.txt

Admins can fully customize the robots.txt file at /admin/customize/robots. A customized robots.txt overrides the default generated one entirely.

Default disallowed paths

By default, Discourse blocks crawlers from paths that aren’t useful for indexing, such as /admin/, /auth/, /email/, /session, /search, and others. Googlebot gets a more permissive configuration, only blocking the core admin/auth paths.

Viewing the crawler version

You can view how search engines see your Discourse site by:

  • Installing a JavaScript-disabling browser plugin in Chrome or Firefox
  • Using Chrome DevTools to disable JavaScript (Google’s instructions)
  • Changing your browser’s user agent string to a known crawler (e.g., Googlebot) to see the dedicated crawler layout

Last edited by @jessii 2025-05-21T22:32:48Z

Check documentPerform check on document:
13 לייקים