Indexing Discourse Community Content in Glean AI

Justin_Gonzalez · April 24, 2025, 5:53pm

Our company recently started using Glean for internal knowledge management. We’d like to index our Discourse Community, but seem to be running into this error messaging:

The limitations of Glean’s website crawler connector include the following:

Access Restrictions: The crawler may struggle with websites that have strict access policies or are behind authentication walls that it cannot breach effectively, despite supporting various authentication schemes (e.g., Basic, Bearer, NTLMv2) and cookies.
Dynamic Content Limitation: By default, the crawler does not index dynamically rendered web pages that require JavaScript unless specific configurations (like enabling Client-Side Rendering (CSR)) are set. This necessitates additional setup actions that might complicate the integration process.
Crawl Frequency and Load Management: While Glean allows for configurable crawl frequencies, organizations may face challenges in managing the load on their servers, especially if multiple instances are active simultaneously. This can lead to issues with performance if not properly orchestrated.
URL Management: The crawler uses regular expressions to match URLs; configuring these regex patterns incorrectly can lead to fetch failures. Moreover, it must respect robots.txt files, which can restrict its crawling of certain pages based on the website’s rules.
Content Type Limitations: The crawler may have limitations in indexing specific content types or formats, such as certain interactive elements or files that are not directly supported by the system (like specific non-text formats) unless custom solutions are implemented.

These limitations can pose challenges for organizations looking to fully leverage the capabilities of Glean’s connector in capturing and indexing web-based information efficiently.

Has anyone successfully indexed their Discourse with an AI provider, like Glean?

Jagster · April 24, 2025, 6:08pm

It is not matter of AI, but crawlers. And AFAIK I know, the answer is no, and yes. If a category is visible for everyone it can be scraped. That`s how googlebot is working. It a forum is behind log in, or visibility of a category is limited by trust levels, scraping is impossible. And I really hope that wounldn’t be breaked ever, because it is one of the most important security measurements.

But sure, you can scrape such ”hidden” content, if

you get a system where a bot can log in and read content, or
you’ll index the content from inside using Discourse AI connected to wanted model (or similar system)

Falco · April 24, 2025, 6:08pm

If you set their user agent to be identified as a crawler bot, Discourse will render a basic HTML view which is much easier to index.

Alternatively, add their user agent to the hidden site setting crawler_user_agents.

system · May 30, 2025, 4:10am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why isn't Google Indexing Discourse? SEO concerns Support seo	31	5177	June 1, 2024
How public Discourse sites are indexed by search engines like Google Site Management reference	0	12618	February 6, 2013
Issues Google Search Console is throwing at me for wrong discourse structure (or some for wrong administration of my site) Support	18	139	December 18, 2024
Option to index topics Support	8	523	December 31, 2022
Google Search Indexing and Discourse Data & reporting	9	3687	June 9, 2020

Indexing Discourse Community Content in Glean AI

Related topics