This reference explains how public Discourse sites work with search engines like Google and how the platform ensures content is properly indexed even as a JavaScript application.
Required user level: All users
Search engine indexing of Discourse sites
Discourse is built as a JavaScript application, but is specifically designed to ensure search engines can properly crawl and index all content on public sites.
How Discourse supports search engine crawlers
While Discourse uses modern JavaScript for its interactive features, it implements several techniques to ensure search engines can properly index all content:
Dedicated crawler layout
Discourse automatically detects search engine bots by their user agent using the CrawlerDetection module. When a crawler is detected, Discourse serves a completely separate server-rendered HTML layout (crawler.html.erb) instead of the normal JavaScript application. This crawler layout includes:
- Full HTML-rendered topic content and topic lists — no JavaScript required
-
Schema.org structured data markup (e.g.,
DiscussionForumPosting,ItemList,BreadcrumbList) to help search engines understand your content - Proper pagination with
rel="prev"andrel="next"links to allow complete crawling -
Last-Modifiedheaders on topic pages to signal content freshness
No-JavaScript fallback
For regular browsers that have JavaScript disabled, Discourse also includes a <noscript> tag in the standard application layout. This contains rendered topic lists and topic content, ensuring the site remains accessible even without JavaScript.
Robots.txt and indexing controls
Discourse provides several settings to control how search engines interact with your site:
allow_index_in_robots_txt
This site setting (enabled by default) controls whether your site’s robots.txt permits crawling. When disabled, the robots.txt will disallow all crawlers and an X-Robots-Tag: noindex header is added to all responses.
Crawler allowlist and blocklist
You can control which crawlers are allowed to access your site using:
-
allowed_crawler_user_agents— when set, only the listed crawlers are permitted; all others are blocked viarobots.txt -
blocked_crawler_user_agents— when set, the listed crawlers are blocked while all others are allowed
Custom robots.txt
Admins can fully customize the robots.txt file at /admin/customize/robots. A customized robots.txt overrides the default generated one entirely.
Default disallowed paths
By default, Discourse blocks crawlers from paths that aren’t useful for indexing, such as /admin/, /auth/, /email/, /session, /search, and others. Googlebot gets a more permissive configuration, only blocking the core admin/auth paths.
Viewing the crawler version
You can view how search engines see your Discourse site by:
- Installing a JavaScript-disabling browser plugin in Chrome or Firefox
- Using Chrome DevTools to disable JavaScript (Google’s instructions)
- Changing your browser’s user agent string to a known crawler (e.g.,
Googlebot) to see the dedicated crawler layout
Last edited by @jessii 2025-05-21T22:32:48Z
Check document
Perform check on document: