Indexing Discourse Community Content in Glean AI

Our company recently started using Glean for internal knowledge management. We’d like to index our Discourse Community, but seem to be running into this error messaging:

The limitations of Glean’s website crawler connector include the following:

  1. Access Restrictions: The crawler may struggle with websites that have strict access policies or are behind authentication walls that it cannot breach effectively, despite supporting various authentication schemes (e.g., Basic, Bearer, NTLMv2) and cookies.
  2. Dynamic Content Limitation: By default, the crawler does not index dynamically rendered web pages that require JavaScript unless specific configurations (like enabling Client-Side Rendering (CSR)) are set. This necessitates additional setup actions that might complicate the integration process.
  3. Crawl Frequency and Load Management: While Glean allows for configurable crawl frequencies, organizations may face challenges in managing the load on their servers, especially if multiple instances are active simultaneously. This can lead to issues with performance if not properly orchestrated.
  4. URL Management: The crawler uses regular expressions to match URLs; configuring these regex patterns incorrectly can lead to fetch failures. Moreover, it must respect robots.txt files, which can restrict its crawling of certain pages based on the website’s rules.
  5. Content Type Limitations: The crawler may have limitations in indexing specific content types or formats, such as certain interactive elements or files that are not directly supported by the system (like specific non-text formats) unless custom solutions are implemented.

These limitations can pose challenges for organizations looking to fully leverage the capabilities of Glean’s connector in capturing and indexing web-based information efficiently.

Has anyone successfully indexed their Discourse with an AI provider, like Glean?

It is not matter of AI, but crawlers. And AFAIK I know, the answer is no, and yes. If a category is visible for everyone it can be scraped. That`s how googlebot is working. It a forum is behind log in, or visibility of a category is limited by trust levels, scraping is impossible. And I really hope that wounldn’t be breaked ever, because it is one of the most important security measurements.

But sure, you can scrape such ”hidden” content, if

  • you get a system where a bot can log in and read content, or
  • you’ll index the content from inside using Discourse AI connected to wanted model (or similar system)

If you set their user agent to be identified as a crawler bot, Discourse will render a basic HTML view which is much easier to index.

Alternatively, add their user agent to the hidden site setting crawler_user_agents.

3 Likes