需要编辑 robots.txt 文件 - 它在哪里?

Correct me if I am wrong, but Latest is the default display but not the default link, right? This has to do with the actual /latest link

We have every single page of latest in the index, the content is like quicksand and there is nothing in the homepage that is “site specific” and not quicksand which is a big problem:

We absolutely do not want people landing on page 2 / 3 etc… page 1 maybe, but the content on page 1 keeps on changing.

This URL for example https://meta.discourse.org/latest?no_definitions=true&no_subcategories=false&page=2 is stored in the Google index.

I am reticent to change stuff though cause I do not know how the big Google will deal with us adding “dont store in index” directives here. Also people never land on these pages anyway cause Google automatically detects they are rubbish and do not send people there.

If there is anything super positive here, I guess it would be having a wonderful “HTML off” homepage that has useful enough content that search engines would send people to the page.

For example, it would be super nice if discourse community discussions ranked meta.discourse.org first cause we had a nice front page.

A simple fix here we can make that can give us lots of mileage is nice expansion of pinned posts:

They are stable content, we can expand that:

In fact we can even expand it a bit further for crawler views. Additionally we could list all the categories on the home page as well in the crawler view… there is a bunch of stuff we can do.

Hello!
this is my file

# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file
#
User-agent: *
Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback
Disallow: /assets/browser-update*.js
Disallow: /users/
Disallow: /u/
Disallow: /my/
Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/
Disallow: /email/
Disallow: /session
Disallow: /session/
Disallow: /admin
Disallow: /admin/
Disallow: /user-api-key
Disallow: /user-api-key/
Disallow: /*?api_key*
Disallow: /*?*api_key*
Disallow: /groups
Disallow: /groups/
Disallow: /t/*/*.rss
Disallow: /tags/*.rss
Disallow: /c/*.rss


User-agent: mauibot
Disallow: /


User-agent: bingbot
Crawl-delay: 60
Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback
Disallow: /assets/browser-update*.js
Disallow: /users/
Disallow: /u/
Disallow: /my/
Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/
Disallow: /email/
Disallow: /session
Disallow: /session/
Disallow: /admin
Disallow: /admin/
Disallow: /user-api-key
Disallow: /user-api-key/
Disallow: /*?api_key*
Disallow: /*?*api_key*
Disallow: /groups
Disallow: /groups/
Disallow: /t/*/*.rss
Disallow: /tags/*.rss
Disallow: /c/*.rss

I read the tutorials above but I do not understand how to fix the question “Need to edit robots.txt file - where is it?”. Looking forward to receiving help from the community

This is the content to be want to update

# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
User-agent: *
Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback
Disallow: /assets/browser-update*.js
Disallow: /users/
Disallow: /u/
Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/

Thanks all

I think you can override the file in your own plugin.

My archive directory is this

robots%20txt

how to override the file in your own plugin

Thanks

You will want to read the plugin development topics and then read this
https://meta.discourse.org/t/how-to-block-all-crawlers-but-googles/62431/4?u=cpradio

I really do not want to block the google search engine that I want to change by content in the robots.txt file

Why does my website not find such a directory /discourse/app/views ?

There is no robots.txt text file per se. It is a Ruby controller

You really need to read some of the dev topics, it explains all of that and more. The plugin should be trivial, to be honest. Or you can post something in marketplace with a budget to see if someone will build it for you.

If that is added, could it be made into an overridable setting? I clicked on this link in the newsletter, because getting user pages indexed is also something we need. We’re hoping to add additional information to them and eventually redirect the old (indexed) user pages to the Discourse ones.

I was just noticing this problem on one of my Discourse sites. The way to block those dynamic URLs from bots while still allowing search engines to crawl /latest is this:

Disallow: /latest?

That will only block the dynamic ones, but not /latest, so search engines would still be able to see the latest content. I tested the rule in Google’s Webmaster Tools and it works.

Here’s an example of some of the dynamic URLs that are getting crawled on my site:

https://gist.githubusercontent.com/j127/d329c15dab45369b03321cad40448734/raw/300aa579b1386087b903da6aa52c52ff5d95828c/latest.txt

Is it possible to add that one line to robots.txt?

(Edit: I looked more closely at the file, and I wouldn’t use noindex there, at least on that dynamic rule. I’m pretty sure that Google has recommended not to use noindex in robots.txt though it was several years ago.)

现在,您可以通过站点设置封禁或限制恶意网络爬虫,这将间接编辑 robots.txt 文件,但我们仍不提供任意编辑功能。

不过我认为我们应该提供该功能……@eviltrout 能否将其纳入 2.4 版本的范围?这回应了许多请求,其中不少我们并不认同,但我的态度是:“后果自负,如果您觉得有必要,那就去做吧 :skull_and_crossbones:

我们至少可以将编辑 robots.txt 文件明确列为社区支持范围之外的事项吗?

顺便一提,任何人都可以通过使用 “robots_txt_index” 连接器模板的简单插件轻松添加额外规则。例如:app/views/connectors/robots_txt_index/sitemap.html.erb

以下是我认为它应该如何工作的流程:

  • 在管理区域添加一个新的未直接链接的 URL。例如 /admin/customize/robots

    • 显示一个包含当前 robots.txt 内容的 <textarea>

    • 如果用户之前未编辑过,则根据白名单/黑名单预填充其内容。

    • 当管理员点击 保存更改 时,内容应保存到数据库,并替换该论坛现有的 robots.txt 内容。

我强烈反对这一提议,因为它将一个晦涩且危险的功能置于用户界面的显著位置。

我认为,目前自定义 robots.txt 的路径应由用户自行输入。如果用户需要此功能,他们应通过搜索 Google 或 Meta 来找到相应路径。

这就是为什么我把它藏在“高级编辑”后面,但如果我们在让界面变得复杂,我可以进一步简化它(会编辑那篇帖子。)

我已为此创建了 PR:

截图:

看起来不错!请确保“还原”按钮使用正确的图标,与站点设置中“还原”功能使用的图标一致。此外,我们统一使用“重置”一词,因此你可以直接复用该文案,无需创建新的翻译。

image

另外,我们需要针对少数会修改 robots.txt 的站点设置添加警告,因为手动编辑等操作会覆盖这些设置。

PR 已合并::tada:

如果您更新至最新的测试通过版本,即可在 /admin/customize/robots 自定义 robots.txt。该页面在 UI 中未提供任何链接,您需要手动将 URL 复制并粘贴到浏览器中访问。

注意:如果您覆盖了该文件,则后续对站点设置的任何修改(例如“白名单爬虫用户代理”等)将不会反映在 robots.txt 文件中(设置会正常保存,但更改不会生效)。您可以恢复为默认版本,此时站点设置将再次对该文件生效。

如果存在覆盖内容,且管理员访问 /robots.txt 文件,他们将在文件顶部看到一条注释,提示存在覆盖,并提供链接以便您修改文件或重置为默认版本。