Our company recently started using Glean for internal knowledge management. We’d like to index our Discourse Community, but seem to be running into this error messaging:
The limitations of Glean’s website crawler connector include the following:
- Access Restrictions: The crawler may struggle with websites that have strict access policies or are behind authentication walls that it cannot breach effectively, despite supporting various authentication schemes (e.g., Basic, Bearer, NTLMv2) and cookies.
- Dynamic Content Limitation: By default, the crawler does not index dynamically rendered web pages that require JavaScript unless specific configurations (like enabling Client-Side Rendering (CSR)) are set. This necessitates additional setup actions that might complicate the integration process.
- Crawl Frequency and Load Management: While Glean allows for configurable crawl frequencies, organizations may face challenges in managing the load on their servers, especially if multiple instances are active simultaneously. This can lead to issues with performance if not properly orchestrated.
- URL Management: The crawler uses regular expressions to match URLs; configuring these regex patterns incorrectly can lead to fetch failures. Moreover, it must respect
robots.txt
files, which can restrict its crawling of certain pages based on the website’s rules. - Content Type Limitations: The crawler may have limitations in indexing specific content types or formats, such as certain interactive elements or files that are not directly supported by the system (like specific non-text formats) unless custom solutions are implemented.
These limitations can pose challenges for organizations looking to fully leverage the capabilities of Glean’s connector in capturing and indexing web-based information efficiently.
Has anyone successfully indexed their Discourse with an AI provider, like Glean?