Googlebot 未抓取“robots.txt”中的通用规则

amotl · 2022 年3 月 1 日 14:07

尊敬的 Discourse 团队：

首先，我们要赞扬你们在构思和维护 Discourse 方面所付出的心血和毅力。我们和我们的社区成员一直很乐意使用它。

我们总结了今天发现的一些小问题，希望对你们有所帮助。另一方面，我们也乐于听到任何我们方面的误解。

此致，
Andreas。

引言

在研究我们托管的 Discourse 实例 https://community.crate.io/ 的 robots.txt 定义文件 ^[1] 的行为时，我们发现 Googlebot 可能没有按预期遵守设置。

评估

让我们以 Meta ^[2] 上的 robots.txt 为例，使用我的个人资料 URL https://meta.discourse.org/u/amotl/ 来进行测试。

通过比较这个免费的 robots.txt 验证器 ^[3] 的结果，可以快速发现差异：

– 使用 Googlebot 进行评估

– 使用 MSNBot 进行评估

研究

关于这个主题，我们相信已经找到了答案：

想法

因此，我们倾向于认为 Googlebot 目前可能完全忽略 User-agent: * 部分 ^[3:1] 中定义的规则，而只遵守 User-agent: Googlebot ^[4] 部分中的规则。

https://community.crate.io/robots.txt ↩︎
https://meta.discourse.org/robots.txt ↩︎

robots.txt，User-agent: * 部分

User-agent: *
Disallow: /admin/
Disallow: /auth/
Disallow: /assets/browser-update*.js
Disallow: /email/
Disallow: /session
Disallow: /user-api-key
Disallow: /*?api_key*
Disallow: /*?*api_key*
Disallow: /badges
Disallow: /u
Disallow: /my
Disallow: /search
Disallow: /tag
Disallow: /g
Disallow: /t/*/*.rss
Disallow: /c/*.rss

↩︎ ↩︎

robots.txt，User-agent: Googlebot 部分

User-agent: Googlebot
Disallow: /auth/
Disallow: /assets/browser-update*.js
Disallow: /email/
Disallow: /session
Disallow: /*?api_key*
Disallow: /*?*api_key*

↩︎

Jagster · 2022 年3 月 1 日 14:13

如果我理解正确的话，答案并非那么简单。如果单个用户通过 Gmail 发送了链接，Googlebot 不会遵守 robots.txt。同样，如果链接被分享到某个地方（反向链接），Googlebot 会将其视为正常的日常链接。

再次强调——robots.txt 只是一种请求。

还有另一种，但更…… 相当多的机器人会表明自己是 Googlebot，而真实情况只能通过 IP 地址找到。

rrit · 2022 年3 月 1 日 15:43

这是正确的，并且是故意这样实现的。

因此，对于那些确实不应该被索引的页面，Googlebot 会收到一个额外的 http-header X-Robots-Tag: noindex。请参阅：

github.com/discourse/discourse

FEATURE: explicitly ban outlier traffic sources in robots.txt (#11553)

committed 09:51PM - 22 Dec 20 UTC

SamSaffron

+27 -9

Googlebot handles no-index headers very elegantly. It advises to leave as many r…outes as possible open and uses headers for high fidelity rules regarding indexes. Discourse adds special `x-robot-tags` noindex headers to users, badges, groups, search and tag routes. Following up on b52143feff8c32f2 we now have it so Googlebot gets special handling. Rest of the crawlers get a far more aggressive disallow list to protect against excessive crawling.

对于您自己的域名，您可以使用 Google Search Console → 检查网址

然后尝试添加一个用户配置文件 URL 进行索引 – 例如 https://www.example.com/u/jacob

amotl · 2022 年3 月 1 日 15:50

尊敬的 Ayke：

确认。

$ http https://meta.discourse.org/u/amotl --print hH | grep -i robot
X-Robots-Tag: noindex

非常感谢您的快速回复、解释以及提供相关补丁的参考。

此致，
Andreas。

sam · 2022 年3 月 2 日 22:11

我已将此移至 Support，感谢您在此处撰写的精彩的错误报告。

我们花费了大量时间来精心调整我们的规则以满足 Google 的要求，x-robots-tag noindex 的支持程度不一，但它是行业标准。仅仅禁止抓取的问题在于，由于某些原因，在特定条件下，页面可能会被 Google 索引，然后由于禁止抓取而无法轻松将其删除，这有点像先有鸡还是先有蛋的问题。

amotl · 2022 年3 月 3 日 08:45

尊敬的 Sam：

非常感谢您分享关于此事的更多细节。一如既往，您在 Discourse 的每个细节中倾注的卓越水平令我惊叹。

我必须承认，我直到现在才知道这一点。所以，再次感谢！

太好了。谢谢。如果这现在已经成为一个被广泛接受的标准，那么一些免费的 robots.txt 验证器将来是否也有可能支持它？

您是否知道任何像 Google Search Console 的“检查网址”那样，除了读取 robots.txt 之外，今天就已经实现了相应额外标头检查的工具？这可能会帮助人们避免我们遇到的同样的困惑。

此致，
Andreas。

system · 2022 年4 月 2 日 08:46

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

话题		回复	浏览量
Pages listed in the robots.txt are crawled and indexed by Google Support	19	3277	2019 年7 月 30 日
Excluding user profiles in robots.txt (or allow edit of file) Feature	5	2501	2014 年5 月 24 日
Google changed how they process robots.txt in Discourse? Support	20	1684	2020 年12 月 22 日
Malformed robots.txt causing issues with indexing Support	9	1848	2023 年4 月 21 日
Indexing User Profiles Support	2	65	2025 年9 月 4 日

Googlebot 未抓取“robots.txt”中的通用规则

引言

评估

研究

想法

相关话题