Discourse SEO overview (sitemap / robots.txt)

Discourse · October 18, 2023, 12:56pm

Discourse has many SEO features that work straight out of the box. Using our sensible defaults, community managers can focus on cultivating a community and should not feel as distracted by optimizing for search engines. That said, there are some things you can change, some things you should know and some general tips and tricks below.

Static view for search engines

Discourse has a static HTML view with no JavaScript to help web crawlers index your site faster. The content between the dynamic and static view is identical and nothing will be omitted or stripped out when the site is crawled by search engines.

Here’s a comparison of what a user sees and what a search engine sees:

Topic list:

Topic:

Previewing the crawler view

You can inspect what Discourse serves to crawlers without disabling JavaScript in two ways:

?print URL parameter — Append ?print to any Discourse URL (e.g. https://yourforum.com/t/some-topic/123?print). This forces the crawler layout for any browser, regardless of user-agent. It is the simplest way to audit what search engines see.
Discourse-Render: crawler request header — Send this HTTP header with any request to force the crawler layout. Useful for programmatic auditing or testing with curl:
```
curl -H "Discourse-Render: crawler" https://yourforum.com/t/some-topic/123
```

Meta tags

In Discourse, the generic meta tags essential for SEO are auto-generated based on the content present on the page. The title tag, for instance, is derived from the site or topic title, and the description is generated from the content of the first post. However, customization on a per-page basis for metadata is limited. To alter these values, you need to adjust the settings or the content fields which they are generated from.

The Title, Description and Short site description site settings
The category names
The posts’ titles and content
And so on

Social media meta tags

Discourse automatically generates Open Graph and Twitter (X) Card meta tags for rich social sharing:

Open graph tags

og:site_name, og:type, og:url, og:title, og:description
og:image - Configurable via opengraph_image setting
og:article:section - Category breadcrumbs
og:article:tag - Topic tags

X (formerly Twitter) card tags

twitter:card - “summary” or “summary_large_image”
twitter:title, twitter:description, twitter:image
twitter:label1/data1 - Reading time estimate
twitter:label2/data2 - Like count

Configuration settings

opengraph_image - Default OG image for social sharing
x_summary_large_image - X (formerly Twitter) large card image

URL structure and encoding

Non-Latin characters and URLs

Discourse, by default, strips out non-Latin characters from topic URLs when the locale is set to EN. To avoid this, you can change the locale to the primary non-Latin language or change the slug generation method setting from ASCII to encoded.

International SEO and hreflang tags

For multilingual communities, Discourse supports hreflang tags, helping search engines serve the correct language version to users. To enable this, you must configure these three settings (Admin → Settings → Localization):

content_localization_enabled: Enables the content localization system.
content_localization_supported_locales: Defines supported locale codes (e.g., en|es|fr).
content_localization_crawler_param: Required to generate <link rel="alternate" hreflang="..."> tags and rewrite internal links for crawlers with the tl locale parameter (e.g., /t/example-topic/123?tl=es).

With these settings enabled, Discourse generates alternate links for each supported language, helping Google and other search engines serve the correct language version to users.

Sub-folder vs. subdomain setup

Discourse leans towards subdomains over sub-folders due to its technical simplicity. Google doesn’t really have a preference between the two^[1], but Discourse strongly recommends avoiding sub-folder setups unless you have deep technical understanding.

If you run Discourse in a subfolder (e.g. example.com/forum/), your root-level robots.txt is managed by your main web server, not Discourse. To help configure it correctly, Discourse exposes a JSON endpoint at /robots-builder.json that returns the exact disallow rules Discourse would generate, scoped to your subfolder path. You can use this as input to a script that writes your server-level robots.txt.

Canonicalization

Google is keen on indexing canonical versions of pages. In Discourse, for a topic with multiple replies, the canonical link (the first post) is handed over to Google, which then makes the call on indexing. Topics longer than 20 posts will be paginated, each page being a canonical link containing up to 20 posts.

For example, the canonical tag for the last reply on this topic will be https://meta.discourse.org/t/try-out-the-new-sidebar-and-notification-menus/238821?page=12.

`allow_indexing_non_canonical_urls` setting

The hidden site setting allow_indexing_non_canonical_urls (default: true) controls whether non-canonical URL variants receive an X-Robots-Tag: noindex header. When set to false, any request to a URL that differs from the canonical URL (e.g. a direct post link like /t/slug/123/45 when the canonical is /t/slug/123?page=3) will be served with a noindex header.

Embedded topics

When topics are embedded on external websites, you can use the embed_set_canonical_url setting to point the canonical URL to the original embed location. This prevents duplicate content issues when the same topic appears both on your forum and the embedding site.

Schema markup

Discourse uses extensive schema.org markup to help search engines understand your content. Each content type includes rich structured data that appears in search results and helps with discoverability.

Topics

Use the DiscussionForumPosting schema to represent the main discussion thread and includes:

headline - The topic title
datePublished - When the topic was created
articleSection - The category name
keywords - All topic tags
publisher - Your site/organization information
author - The original poster’s information

Posts and replies

Individual posts within topics use the Comment schema, including:

author - Post author with name and profile link
text - The post content
datePublished and dateModified - Creation and edit timestamps
interactionStatistic - Like counts using InteractionCounter with LikeAction type

Breadcrumbs

Category navigation uses the BreadcrumbList schema that appears in search results, showing the category hierarchy path. Each breadcrumb includes:

itemListElement - Individual category links
position - Order in the hierarchy
Category colors for visual distinction

Categories and topic lists

Category pages and topic list views use the ItemList schema:

Ordered with ItemListOrderDescending for chronological sorting
Includes position metadata for each item
Helps search engines understand content structure and hierarchy

Homepage

The site homepage includes JSON-LD structured data for WebSite with SearchAction, which enables the Google search box to appear directly in search results for your site.

About page

Your About page uses About page schema with Organization information, helping search engines understand your community’s identity and purpose.

Sitemap

Discourse incorporates a sitemap index located at /sitemap.xml which is enabled by default via the enable sitemap setting. This facilitates better indexing by search engines. There are other sitemaps as well:

Recent Sitemap (/sitemap_recent.xml) - Topics bumped in the last 3 days (cached for 1 hour)
News Sitemap (/news.xml) - Topics bumped in the last 72 hours in Google News format (cached for 5 minutes), useful for news-oriented communities
Paginated Sitemaps - Full catalog of topics split into pages (cached for 24 hours)

Sitemaps are automatically regenerated hourly by a scheduled job and include last modified timestamps. You can configure the number of topics per sitemap page via the sitemap_page_size setting (default: 10,000).

Web crawlers

Web crawlers, also known as robots or bots, are essential for indexing web pages and making your content discoverable. Discourse uses sophisticated crawler detection to serve optimized content and manage bot traffic effectively.

Crawler detection

Discourse automatically detects and handles various types of crawlers, including:

Search engines: Googlebot, Bingbot, DuckDuckBot, and others
Social media: Facebookbot, Twitterbot, LinkedInBot, Discordbot
AI crawlers: GPTBot, ClaudeBot, Anthropic-AI, BrightBot
Archive services: Wayback Machine, Archive.org
Monitoring services: Lighthouse, Google Inspection Tool

When a crawler is detected, Discourse serves optimized content and adds special response headers:

X-Discourse-Crawler-View: true - Indicates crawler-optimized content
Last-Modified headers - Enables efficient re-crawling

Managing crawler traffic

Some crawlers can be overly enthusiastic, hitting your forum with many requests. Discourse provides several settings to manage crawler behavior:

blocked_crawler_user_agents - Completely block specific crawlers (default blocklist includes: mauibot, semrushbot, ahrefsbot, blexbot, seo spider)
slow_down_crawler_user_agents - Rate limit crawlers rather than block them (default includes AI crawlers like GPTBot, ClaudeBot)
allowed_crawler_user_agents - Danger: setting this will block ALL crawlers not explicitly listed, including Googlebot. Only use this if you intend to restrict crawling to a specific set of bots. It takes full precedence over blocked_crawler_user_agents. Leave blank unless you have a specific reason to use it.

Automatic noindex headers

Beyond the global allow_index_in_robots_txt setting, Discourse automatically adds X-Robots-Tag: noindex response headers to several page types:

Search pages — always noindexed (search results are not useful as indexed pages)
Badge pages — always noindexed
Group pages — always noindexed
User profile pages — always noindexed
Tag filter/intersection pages — noindexed (tag index and show pages are exempt)
RSS/Atom feeds — always noindexed

These are applied at the controller level and are not configurable. They ensure that ephemeral, low-value, or duplicate-content pages do not pollute search engine indexes.

Crawler analytics

Administrators can monitor crawler activity at Admin → Reports → Web Crawlers to see which bots are accessing your site and how frequently.

Additional SEO features

AI crawler guidance through `llms.txt`

Discourse supports the llms.txt convention, which provides a standard way to guide LLM-based crawlers (like those used by AI assistants) about your site’s content and structure.

To enable it:

Prepare a .txt or .md file (max 512 KB) describing your site for LLM crawlers
Upload it via the llms_txt site setting (Admin → Settings → Security)
It will be served at https://yourforum.com/llms.txt

The llms.txt is distinct from robots.txt and is specifically designed to guide LLM-based crawlers. Rather than providing access rules, it provides semantic context…describing what your site is, what content it contains, and how an AI assistant should understand and use it. Think of robots.txt as telling bots ‘where they can go’, and llms.txt as telling AI ‘what it’s looking at and why it matters’.

OpenSearch integration

Discourse provides an OpenSearch description at /opensearch.xml, enabling browser search integration. Users can add your forum directly to their browser’s search engine list.

RSS feeds

Each topic has an RSS feed available at /t/{slug}/{id}.rss, which includes:

Topic title and description
All posts/replies with author information
Publication dates and last update times
Categories and tags

Note: RSS feed paths (/t/*/*.rss, /c/*.rss) are disallowed in robots.txt for the general * wildcard agent to prevent duplicate content indexing by most crawlers. Googlebot is explicitly exempted from this restriction and can access RSS feeds.

Google site verification

Use the google_site_verification_token setting to add Google Search Console verification meta tags without editing theme templates.

Progressive web app (PWA)

Discourse generates a Web App Manifest at /manifest.json for mobile app discovery and installation prompts.

Migrations and URL redirections

The permalink feature is used to redirect old URLs, aiming to preserve SEO, preventing “Page Not Found” errors and assist search engines with the right metadata for easier indexing.

If your community site is migrated to Discourse by our team, the URL redirections are included unless there are valid reasons not to do so.

If you are using one of the existing importer scripts, you should ensure that the script handles this^[2]. You can manually add permalinks from your admin panel, in Customize → Permalinks.

De-indexing methods

To get pages out of Google’s index, you can either remove content or block access to a page. Depending on your needs, you can make your whole site private^[3]. You can exclude topics by deleting them or putting them in restricted categories. Hidden topics aren’t indexed by default, but they can be if there’s a public link somewhere that redirects to it.

For a lasting removal, using the Removals tool in the Google Search Console is the ticket to keeping pages out of search results. Learn more at Remove information on your website from Google - Search Console Help

↩︎
Looking for the permalink string in the import script should give you this info. ↩︎
Look for the login required setting. ↩︎

Last edited by @MarkDoerr 2026-07-10T01:34:18Z

Last checked by @MarkDoerr 2026-03-19T23:39:55Z

Check document
Perform check on document:

Topic		Replies	Views
Handling SEO for Discourse Communities \| Blog Blog	4	2168	January 13, 2023
How public Discourse sites are indexed by search engines like Google Site Management reference	0	12893	February 6, 2013
Is there any Discourse SEO Plugin? Support	2	165	March 13, 2025
Why isn't Google Indexing Discourse? SEO concerns Support seo	31	5816	June 1, 2024
1 million Topics - Takes millions of days to get indexed without Sitemap in Robots Support	3	2668	February 20, 2017