How public Discourse sites are indexed by search engines like Google

Discourse · 6 februari 2013 om 12:14

This reference explains how public Discourse sites work with search engines like Google and how the platform ensures content is properly indexed even as a JavaScript application.

Required user level: All users

Search engine indexing of Discourse sites

Discourse is built as a JavaScript application, but is specifically designed to ensure search engines can properly crawl and index all content on public sites.

How Discourse supports search engine crawlers

While Discourse uses modern JavaScript for its interactive features, it implements several techniques to ensure search engines can properly index all content:

Dedicated crawler layout

Discourse automatically detects search engine bots by their user agent using the CrawlerDetection module. When a crawler is detected, Discourse serves a completely separate server-rendered HTML layout (crawler.html.erb) instead of the normal JavaScript application. This crawler layout includes:

Full HTML-rendered topic content and topic lists — no JavaScript required
Schema.org structured data markup (e.g., DiscussionForumPosting, ItemList, BreadcrumbList) to help search engines understand your content
Proper pagination with rel="prev" and rel="next" links to allow complete crawling
Last-Modified headers on topic pages to signal content freshness

No-JavaScript fallback

For regular browsers that have JavaScript disabled, Discourse also includes a <noscript> tag in the standard application layout. This contains rendered topic lists and topic content, ensuring the site remains accessible even without JavaScript.

Robots.txt and indexing controls

Discourse provides several settings to control how search engines interact with your site:

`allow_index_in_robots_txt`

This site setting (enabled by default) controls whether your site’s robots.txt permits crawling. When disabled, the robots.txt will disallow all crawlers and an X-Robots-Tag: noindex header is added to all responses.

Crawler allowlist and blocklist

You can control which crawlers are allowed to access your site using:

allowed_crawler_user_agents — when set, only the listed crawlers are permitted; all others are blocked via robots.txt
blocked_crawler_user_agents — when set, the listed crawlers are blocked while all others are allowed

Custom robots.txt

Admins can fully customize the robots.txt file at /admin/customize/robots. A customized robots.txt overrides the default generated one entirely.

Default disallowed paths

By default, Discourse blocks crawlers from paths that aren’t useful for indexing, such as /admin/, /auth/, /email/, /session, /search, and others. Googlebot gets a more permissive configuration, only blocking the core admin/auth paths.

Viewing the crawler version

You can view how search engines see your Discourse site by:

Installing a JavaScript-disabling browser plugin in Chrome or Firefox
Using Chrome DevTools to disable JavaScript (Google’s instructions)
Changing your browser’s user agent string to a known crawler (e.g., Googlebot) to see the dedicated crawler layout

Last edited by @jessii 2025-05-21T22:32:48Z

Check document
Perform check on document:

Topic		Antwoorden	Weergaven
The effect of endless scrolling = Bad for Google / SEO Feature	2	3697	24 februari 2014
Discourse SEO overview (sitemap / robots.txt) Site Management seo , explanation	0	2298	18 oktober 2023
Ember and SEO challenges regarding discourse.org Site feedback	3	4959	20 juni 2013
Why isn't Google Indexing Discourse? SEO concerns Support seo	31	5626	1 juni 2024
Replace Google search with Discourse search on not found page UX	12	3632	30 augustus 2018