أحتاج لتعديل ملف robots.txt - أين يوجد؟

jerdog · 8 أغسطس 2018، 12:38ص

Correct me if I am wrong, but Latest is the default display but not the default link, right? This has to do with the actual /latest link

sam · 8 أغسطس 2018، 12:46ص

We have every single page of latest in the index, the content is like quicksand and there is nothing in the homepage that is “site specific” and not quicksand which is a big problem:

We absolutely do not want people landing on page 2 / 3 etc.. page 1 maybe, but the content on page 1 keeps on changing.

This URL for example https://meta.discourse.org/latest?no_definitions=true&no_subcategories=false&page=2 is stored in the Google index.

I am reticent to change stuff though cause I do not know how the big Google will deal with us adding “dont store in index” directives here. Also people never land on these pages anyway cause Google automatically detects they are rubbish and do not send people there.

If there is anything super positive here, I guess it would be having a wonderful “HTML off” homepage that has useful enough content that search engines would send people to the page.

For example, it would be super nice if discourse community discussions ranked meta.discourse.org first cause we had a nice front page.

A simple fix here we can make that can give us lots of mileage is nice expansion of pinned posts:

They are stable content, we can expand that:

In fact we can even expand it a bit further for crawler views. Additionally we could list all the categories on the home page as well in the crawler view… there is a bunch of stuff we can do.

Pham_Quyet_Nghi · 8 أغسطس 2018، 1:41ص

Hello!
this is my file

# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file
#
User-agent: *
Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback
Disallow: /assets/browser-update*.js
Disallow: /users/
Disallow: /u/
Disallow: /my/
Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/
Disallow: /email/
Disallow: /session
Disallow: /session/
Disallow: /admin
Disallow: /admin/
Disallow: /user-api-key
Disallow: /user-api-key/
Disallow: /*?api_key*
Disallow: /*?*api_key*
Disallow: /groups
Disallow: /groups/
Disallow: /t/*/*.rss
Disallow: /tags/*.rss
Disallow: /c/*.rss


User-agent: mauibot
Disallow: /


User-agent: bingbot
Crawl-delay: 60
Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback
Disallow: /assets/browser-update*.js
Disallow: /users/
Disallow: /u/
Disallow: /my/
Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/
Disallow: /email/
Disallow: /session
Disallow: /session/
Disallow: /admin
Disallow: /admin/
Disallow: /user-api-key
Disallow: /user-api-key/
Disallow: /*?api_key*
Disallow: /*?*api_key*
Disallow: /groups
Disallow: /groups/
Disallow: /t/*/*.rss
Disallow: /tags/*.rss
Disallow: /c/*.rss

I read the tutorials above but I do not understand how to fix the question “Need to edit robots.txt file - where is it?”. Looking forward to receiving help from the community

This is the content to be want to update

# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
User-agent: *
Disallow: /auth/cas
Disallow: /auth/facebook/callback
Disallow: /auth/twitter/callback
Disallow: /auth/google/callback
Disallow: /auth/yahoo/callback
Disallow: /auth/github/callback
Disallow: /auth/cas/callback
Disallow: /assets/browser-update*.js
Disallow: /users/
Disallow: /u/
Disallow: /badges/
Disallow: /search
Disallow: /search/
Disallow: /tags
Disallow: /tags/

Thanks all

Stranik · 8 أغسطس 2018، 6:58ص

I think you can override the file in your own plugin.

github.com/discourse/discourse

app/views/robots_txt/index.erb

main

<%= @robots_info[:header] %>
<% if Discourse.base_path.present? %>
# This robots.txt file is not used. Please append the content below in the robots.txt file located at the root
<% end %>
#
<% @robots_info[:agents].each do |agent| %>
User-agent: <%= agent[:name] %>
<% agent[:disallow].each do |path| %>
Disallow: <%= path %>
<% end %>


<% end %>

<%- if SiteSetting.enable_sitemap? && !SiteSetting.login_required? %>
Sitemap: <%= request.protocol %><%= request.host_with_port %>/sitemap.xml
<% end %>

<%= server_plugin_outlet "robots_txt_index" %>

Pham_Quyet_Nghi · 11 أغسطس 2018، 2:05ص

My archive directory is this

how to override the file in your own plugin

Thanks

cpradio · 11 أغسطس 2018، 2:15ص

You will want to read the plugin development topics and then read this
https://meta.discourse.org/t/how-to-block-all-crawlers-but-googles/62431/4?u=cpradio

Pham_Quyet_Nghi · 11 أغسطس 2018، 2:26ص

I really do not want to block the google search engine that I want to change by content in the robots.txt file

Why does my website not find such a directory /discourse/app/views ?

Mittineague · 11 أغسطس 2018، 3:32ص

There is no robots.txt text file per se. It is a Ruby controller

github.com/discourse/discourse

app/controllers/robots_txt_controller.rb

main

# frozen_string_literal: true

class RobotsTxtController < ApplicationController
  layout false
  skip_before_action :preload_json,
                     :check_xhr,
                     :redirect_to_login_if_required,
                     :redirect_to_profile_if_required

  OVERRIDDEN_HEADER = "# This robots.txt file has been customized at /admin/customize/robots\n"

  # NOTE: order is important!
  DISALLOWED_PATHS = %w[
    /admin/
    /auth/
    /assets/js/browser-update*.js
    /email/
    /session
    /user-api-key
    /*?api_key*

This file has been truncated. show original

cpradio · 11 أغسطس 2018، 3:42ص

You really need to read some of the Development topics, it explains all of that and more. The plugin should be trivial, to be honest. Or you can post something in Marketplace with a budget to see if someone will build it for you.

j127 · 16 أغسطس 2018، 5:53م

If that is added, could it be made into an overridable setting? I clicked on this link in the newsletter, because getting user pages indexed is also something we need. We’re hoping to add additional information to them and eventually redirect the old (indexed) user pages to the Discourse ones.

j127 · 13 أبريل 2019، 5:08م

I was just noticing this problem on one of my Discourse sites. The way to block those dynamic URLs from bots while still allowing search engines to crawl /latest is this:

Disallow: /latest?

That will only block the dynamic ones, but not /latest, so search engines would still be able to see the latest content. I tested the rule in Google’s Webmaster Tools and it works.

Here’s an example of some of the dynamic URLs that are getting crawled on my site:

https://gist.githubusercontent.com/j127/d329c15dab45369b03321cad40448734/raw/300aa579b1386087b903da6aa52c52ff5d95828c/latest.txt

Is it possible to add that one line to robots.txt?

(Edit: I looked more closely at the file, and I wouldn’t use noindex there, at least on that dynamic rule. I’m pretty sure that Google has recommended not to use noindex in robots.txt though it was several years ago.)

codinghorror · 9 يوليو 2019، 11:30م

يمكنك الآن حظر أو تقييد عناكب الويب المسيئة عبر إعدادات الموقع، مما يعدّل ملف robots.txt بشكل غير مباشر، لكننا لا نوفر بعد إمكانية التعديل الحر عليه.

أعتقد أننا ينبغي أن نفعل ذلك.. @eviltrout هل يمكنك تحديد نطاق هذه الميزة للإصدار 2.4؟ إنها تجيب على العديد من الطلبات، وكثير منها لا نتفق معه، لكن موقفى هو “الأمر يعود إليك، فافعل ما تشاء إذا شعرت بالحاجة لذلك ”

Stephen · 10 يوليو 2019، 2:24ص

هل يمكننا على الأقل اعتبار تحرير ملف robots.txt خارج نطاق دعم المجتمع تمامًا؟

vinothkannans · 10 يوليو 2019، 5:17ص

للعلم، يمكن لأي شخص إضافة قواعد إضافية بسهولة من خلال إضافة بسيطة باستخدام قالب “robots_txt_index” للربط. على سبيل المثال: app/views/connectors/robots_txt_index/sitemap.html.erb

eviltrout · 10 يوليو 2019، 7:25م

إليك كيف أعتقد أنه ينبغي أن يعمل:

أضف عنوان URL جديدًا إلى قسم الإدارة غير مرتبط مباشرة. على سبيل المثال /admin/customize/robots
- اعرض عنصر <textarea> يحتوي على محتوى robots.txt الحالي.
- إذا لم يكونوا قد عدلوه من قبل، فاملأه مسبقًا بالمحتوى بناءً على القائمة البيضاء/القائمة السوداء.
- عندما يضغط المسؤول على زر حفظ التغييرات، يجب حفظه في قاعدة البيانات ويستبدل المحتوى الحالي لملف robots.txt في هذا المنتدى.

codinghorror · 10 يوليو 2019، 7:31م

أنا معارض بشدة لهذا الاقتراح، لأنه يعطي خيارًا غامضًا وخطيرًا أولوية عالية في واجهة المستخدم.

أعتقد أن المسار لتخصيص robots.txt يجب أن يكون مخصصًا ويُدخل يدويًا في الوقت الحالي. إذا رغب المستخدمون في ذلك، فعليهم البحث في Google أو Meta لإيجاد المسار.

eviltrout · 10 يوليو 2019، 7:44م

لهذا السبب أخفيته خلف “التحرير المتقدم”، ولكن إذا كنا نغيب الواجهة، فيمكنني تبسيطه أكثر (سأقوم بتعديل هذا المنشور).

Osama · 11 يوليو 2019، 8:44م

لقد قمت بإنشاء طلب سحب (PR) لهذا:

github.com/discourse/discourse

FEATURE: Allow customization of robots.txt (#7884)

master ← OsamaSayegh:customize-robots-txt

merged 05:47PM - 15 Jul 19 UTC

OsamaSayegh

+282 -7

This allows admins to customize/override the content of the robots.txt file at …/admin/customize/robots. That page is not linked to anywhere in the UI -- admins have to manually type the URL to access that page. Meta topic: https://meta.discourse.org/t/needing-to-edit-robots-txt-file-where-is-it/93879?u=osama Screenshots: ![image](https://user-images.githubusercontent.com/17474474/61083151-b5174800-a433-11e9-890e-283221046b2b.png) ![image](https://user-images.githubusercontent.com/17474474/61083312-16d7b200-a434-11e9-9cae-dc3a9982c049.png) @eviltrout does it make sense to prepend a comment to robots.txt that says something along the lines of "this robots.txt file has been customized at /admin/customize/robots" **if** the file is customized? It might help with figuring out why certain things are in the file and how to remove/change them?

لقطات الشاشة:

codinghorror · 12 يوليو 2019، 12:14ص

يبدو الأمر جيدًا! تأكد من أن زر التراجع يستخدم الرمز الصحيح، وهو نفس الرمز الذي نستخدمه في التراجع في إعدادات الموقع. كما أننا نستخدم كلمة “إعادة تعيين” فقط، لذا يمكنك إعادة استخدام ذلك النص بدلاً من إنشاء ترجمة جديدة.

كما نحتاج إلى بعض التحذيرات بخصوص مجموعة صغيرة من إعدادات الموقع التي تُعدّل ملف robots.txt، والتي سيتم تجاوزها إذا قمت بتعديله يدويًا وما إلى ذلك.

Osama · 15 يوليو 2019، 6:44م

تم دمج طلب السحب (PR) للتو:

إذا قمت بالتحديث إلى أحدث إصدار ناجح الاختبارات، فستتمكن من تخصيص ملف robots.txt من خلال /admin/customize/robots. هذه الصفحة غير مربوطة بأي مكان في واجهة المستخدم، لذا ستحتاج إلى نسخ عنوان URL ولصقه يدويًا في متصفحك.

ملاحظة: إذا قمت بكتابة ملف بديل، فلن تنطبق أي تغييرات لاحقة على إعدادات الموقع التي تعدّل ملف robots.txt (مثل وكلاء الزحف المصرح بهم وما إلى ذلك) على الملف (سيتم حفظ الإعدادات بشكل صحيح، لكن لن ينعكس التغيير على ملف robots.txt). يمكنك استعادة النسخة الافتراضية، وعندها ستبدأ إعدادات الموقع في التطبيق على الملف مرة أخرى.

إذا كان هناك ملفات بديلة وقام مسؤول بمعاينة الملف في /robots.txt، فسيرى تعليقًا في الأعلى يشير إلى وجود تعديلات، مع روابط تتيح له تعديل الملف أو إعادة تعيينه إلى النسخة الافتراضية.

الموضوع		الردود	مرات العرض
Why there are lots of Disallow rule in robots.txt? Support	34	4738	22 ديسمبر 2020
Issues Google Search Console is throwing at me for wrong discourse structure (or some for wrong administration of my site) Support	18	279	18 ديسمبر 2024
Excluding user profiles in robots.txt (or allow edit of file) Feature	4	2554	24 مايو 2014
Pages listed in the robots.txt are crawled and indexed by Google Support	18	3411	30 يوليو 2019
Google notification to remove "noindex" statements from robots.txt Support	7	2513	30 يوليو 2019

أحتاج لتعديل ملف robots.txt - أين يوجد؟

الموضوعات ذات الصلة