Google 是否更改了在 Discourse 中处理 robots.txt 的方式？

jackjjw · 2020 年5 月 11 日 15:37

我的论坛网站链接已经上线几周了，并且我已将网址提交给 Google。之前收到了“禁止索引”的警告，但似乎只针对个人资料页面，这还不错。

不过目前 Google 上还没有显示任何内容。我需要在论坛端做些什么吗？还是只需等待 Google 来抓取？

satonotdead · 2020 年5 月 11 日 16:30

或许您可以试试 https://search.google.com/search-console/？

jackjjw · 2020 年5 月 12 日 07:00

看起来系统提示帖子页面被 robots.txt 阻止了，但这并非我有意设置。请问 Discourse 中是否有需要更改的设置来解除限制？谢谢

sam · 2020 年5 月 12 日 07:11

有一个站点设置，请在您的站点设置中搜索 allow index in robots txt，它应该处于启用状态（默认即为启用）

jackjjw · 2020 年5 月 12 日 07:13

谢谢 Sam，该设置已勾选，这样对吗？

抱歉，我有点搞混了，看起来被阻止的 URL 是 RSS 源对应的链接。

我想这只需要等待 Google 更新或抓取网站即可。

sam · 2020 年5 月 12 日 07:22

是的，这个问题一直在重复发生，并不断引发支持请求。

Googlebot 确实有点烦人。你无法在 robots.txt 中明确告知它不要索引某些内容。我们正在着手修复以安抚 Googlebot，但全面部署还需要一段时间。

我们在 robots.txt 中告知 Googlebot：“嘿……不要索引网站上所有的 .rss 页面”。
Googlebot 在某个地方发现了指向网站上 .rss 文件的链接。
随后，Googlebot 向站点管理员抱怨说网站上存在 .rss 文件，但它无法确定如何处理该链接，因为它不被允许索引该页面。有时，它甚至会将此类内容包含在搜索结果中。
站点管理员随后在 Meta 上抱怨。

我们的一般解决方案是允许 Googlebot 爬取网站上的每一页，并利用 HTTP 标头中的规范链接和索引提示，引导它走向正确的方向。

我正在与 @jomaxro 合作处理此事，我们已取得了一些不错的进展。

（供参考 @codinghorror）

jackjjw · 2020 年5 月 12 日 07:29

感谢更新，Sam，这一切都说得通，我完全理解你的难处。我虽然不是 SEO 专家，但曾经运营过更大的网站，并与 SEO 团队合作过。在论坛上，这往往非常棘手！

jomaxro · 2020 年5 月 12 日 16:59

需要明确的是，这与讨论论坛本身无关。这与谷歌处理 robots.txt 的……有趣……方式有关。详见 Robots.txt Introduction and Guide | Google Search Central | Documentation | Google for Developers

即使页面被 robots.txt 禁止抓取，如果其他网站链接到该页面，它仍可能被索引
虽然谷歌不会抓取或索引被 robots.txt 阻止的内容，但如果该禁止 URL 被网络其他地方的链接指向，我们仍可能发现并索引该 URL。因此，URL 地址以及潜在的其他公开可用信息（如指向该页面的链接锚文本）仍可能出现在谷歌搜索结果中。要正确防止您的 URL 出现在谷歌搜索结果中，您应该对服务器上的文件进行密码保护或使用 noindex 元标签或响应头（或者彻底删除该页面）。

我们长期以来一直在每个 Discourse 站点默认的 robots.txt 文件中包含不希望被索引的页面。这以前一直运作良好。但在过去的某个未知时间点，这已不再足够，谷歌决定索引从其他地方链接到的页面，即使这些页面已通过 robots.txt 被禁止。

因此，今年早些时候，我们开始测试在某些页面上包含 noindex 头。这本来效果很好，但现在我们遇到了 robots.txt 与头文件之间的冲突。详见 Block Search Indexing with noindex | Google Search Central | Documentation | Google for Developers

重要提示！要使 noindex 指令生效，页面不得被 robots.txt 文件阻止。如果页面被 robots.txt 文件阻止，爬虫将永远无法看到 noindex 指令，该页面仍可能出现在搜索结果中，例如当其他页面链接到它时。

这引出了我们今天的情况。我们正在测试从 robots.txt 中移除某些页面。我们必须小心，因为这些更改都是基于谷歌文档进行的，我们确保对 Googlebot 没有问题，但还需要检查其他主要爬虫，以确保不会在那里引发问题。

codinghorror · 2020 年5 月 13 日 07:04

[quote=“jomaxro, 第 8 楼，主题 151064”]
这之前一直运行良好。但在过去的某个未知时间点，这已不再足够，Google 决定索引来自其他地方的链接页面，即使这些页面在 robots.txt 中被禁止抓取。[/quote]
引用以作强调。Google 在此处的行为发生了变化，我们并未改变，因此需要一些时间来适应。

jackjjw · 2020 年6 月 25 日 13:28

你好，Jeff，这一切我都明白了。我只是想再确认一下，是不是我在 Google 的设置中不小心隐藏了论坛页面？目前主页和分类页面在 Google 中都能搜到，但所有论坛页面都搜不到，这种情况已经持续几个月了。这是我的网站：https://community.jackwallington.com/

codinghorror · 2020 年6 月 25 日 17:20

我相信我们已在己方完成了所有调整，以适应最近的 Google 行为变化……也许 @jomaxro 可以确认一下？您应确保使用最新版本的 Discourse。

jomaxro · 2020 年6 月 25 日 17:27

我不太确定，需要核实一下。我认为我们在测试期间对 robots.txt 进行了一些手动修改（仅限 Meta）。

jomaxro · 2020 年6 月 25 日 17:32

查看 discourse/app/controllers/robots_txt_controller.rb at main · discourse/discourse · GitHub 可以看出，这些更改是局部的（仅限 Meta）。我会修复这个问题。我们还有一些长时间运行的测试正在进行中，但我对此很有信心。

jomaxro · 2020 年6 月 25 日 19:15

已根据
https://github.com/discourse/discourse/commit/b52143feff8c32f21ed53033b6a0a65ee45dce0e
进行了必要的更改

jackjjw · 2020 年6 月 25 日 19:31

难道我在文章页面的某个地方设置了 noindex 吗？尽管 Google 表示现在会忽略这一点。

jomaxro · 2020 年6 月 25 日 19:40

除非您安装了用于添加该内容的插件，否则我无法想到任何添加此类标头的方法。Google 不会忽略 noindex 标头。当其他网站指向您的页面时，Google 会忽略 robots.txt 文件。但在抓取时，Google 会遵守 robots.txt 的规定，因此上述提交移除了 robots.txt 中的条目，转而使用之前添加的 noindex 标头。

建议您注册 Google Search Console，以便亲自查看 Google 看到了什么。也许存在其他问题导致主题无法被索引。

jackjjw · 2020 年6 月 25 日 19:54

谢谢 Joshua，Google Search Console 看起来没问题，显示所有线程都已列出。很奇怪，当我搜索这些线程时，线程页面没有显示，但首页和分类页面却显示了。

sam · 2020 年12 月 22 日 05:17

我将回退此更改，并针对 Googlebot 明确设置此条件。

Googlebot 是非常智能的爬虫，但许多其他爬虫并不那么智能。

jomaxro · 2020 年12 月 22 日 05:22

说得对。请注意，还有一个后续的提交也需要回滚。

sam · 2020 年12 月 22 日 06:01

我创建了这个 PR 来解决此问题：

github.com/discourse/discourse

FEATURE: explicitly ban outlier traffic sources in robots.txt (#11553)

master ← crawl-less

merged 09:51PM - 22 Dec 20 UTC

SamSaffron

+27 -9

Googlebot handles no-index headers very elegantly. It advises to leave as many r…outes as possible open and uses headers for high fidelity rules regarding indexes. Discourse adds special `x-robot-tags` noindex headers to users, badges, groups, search and tag routes. Following up on b52143feff8c32f2 we now have it so Googlebot gets special handling. Rest of the crawlers get a far more aggressive disallow list to protect against excessive crawling.

Google 将继续保留其特殊规则，而我们则针对其他较为简单的机器人提供增强的保护。默认的 robots.txt 文件现在如下所示：

# 有关如何使用 robots.txt 文件的说明，请参阅 http://www.robotstxt.org/robotstxt.html
#
User-agent: mauibot
Disallow: /


User-agent: semrushbot
Disallow: /


User-agent: ahrefsbot
Disallow: /


User-agent: blexbot
Disallow: /


User-agent: seo spider
Disallow: /


User-agent: *
Disallow: /admin/
Disallow: /auth/
Disallow: /assets/browser-update*.js
Disallow: /email/
Disallow: /session
Disallow: /user-api-key
Disallow: /*?api_key*
Disallow: /*?*api_key*
Disallow: /badges
Disallow: /u
Disallow: /my
Disallow: /search
Disallow: /tags
Disallow: /g
Disallow: /t/*/*.rss
Disallow: /tags/*.rss
Disallow: /c/*.rss


User-agent: Googlebot
Disallow: /admin/
Disallow: /auth/
Disallow: /assets/browser-update*.js
Disallow: /email/
Disallow: /session
Disallow: /user-api-key
Disallow: /*?api_key*
Disallow: /*?*api_key*

话题		回复	浏览量
Pages listed in the robots.txt are crawled and indexed by Google Support	19	3348	2019 年7 月 30 日
Google notification to remove "noindex" statements from robots.txt Support	8	2480	2019 年7 月 30 日
Google complaining – Indexed, though blocked by robots.txt Support	24	2584	2023 年9 月 28 日
Issues Google Search Console is throwing at me for wrong discourse structure (or some for wrong administration of my site) Support	18	252	2024 年12 月 18 日
Generic rules in "robots.txt" not picked up by Googlebot Support	6	1028	2022 年4 月 2 日

Google 是否更改了在 Discourse 中处理 robots.txt 的方式？

相关话题