¿Por qué semrushbot y ahrefsbot están bloqueados de forma predeterminada?

Jamie_Liu1 · 14 Julio, 2020 08:57

Estaba revisando el informe de cobertura de Google Search Console y descubrí que muchas de nuestras páginas del foro están bloqueadas por robots.txt. Así que procedí a verificar el archivo robots.txt. Luego descubrí que semrushbot y ahrefsbot están bloqueados por defecto:

Sé que estas son dos herramientas SEO muy utilizadas, ¿por qué bloquear sus bots?

neounix · 14 Julio, 2020 09:03

Porque esos bots son «vampiros de recursos» que aportan muy poco valor a los sitios en comparación con la cantidad de recursos que consumen.

Por supuesto, puedes personalizar el archivo robots.txt de Discourse y permitirlos si lo deseas; pero bloqueamos estos bots en nuestros sitios mucho antes de que Discourse se lanzara y seguimos manteniéndolos bloqueados.

Nota (Editada):

Olvidé mencionar que muchos de estos «vampiros de recursos» no respetan robots.txt y deben bloquearse a nivel del agente de usuario HTTP. En términos generales, bloqueamos a estos «vampiros de recursos irrespetuosos» mediante mod_rewrite a nivel del proxy inverso (una de las muchas buenas razones para operar detrás de un proxy inverso, por cierto).

Jamie_Liu1 · 14 Julio, 2020 09:29

¡Muchas gracias por la información!

He encontrado otro problema y quizás puedas compartir tu opinión al respecto también.

Sé que Discourse ha bloqueado las páginas de usuario por defecto, pero en mi informe de cobertura de Google Search Console, todavía hay algunas páginas de usuario indexadas, lo cual es un problema a los ojos de Google, ya que todas estas páginas no deberían estar indexadas:

¡Gracias!

osioke · 14 Julio, 2020 12:35

Esto se corrigió recientemente con

¿Puedes actualizar tu Discourse y volver a verificar?

Jamie_Liu1 · 15 Julio, 2020 02:14

@osioke ¡Gracias por tu respuesta! Creo que nuestra versión instalada ya incluye esta función, ¿no? Porque noté que la corrección se incorporó en enero.

¿Podrías verificar si necesito actualizar a la última versión para tener esta función?

osioke · 15 Julio, 2020 07:03

No hace daño actualizar, en mi opinión, pero sí, esa corrección debería estar en tu versión instalada. Te recomendaría intentar actualizar y volver a verificar, a menos que no quieras actualizar por alguna otra razón.

codinghorror · 15 Julio, 2020 21:41

¿Porque son pésimas? Añaden mucha carga al servidor sin ningún beneficio discernible, y nuestros clientes tienen límites de visualizaciones de página en sus planes.

Jamie_Liu1 · 16 Julio, 2020 02:13

Suena bien. Estamos actualizando ahora. Esperemos que las cosas funcionen después de la actualización. Te avisaré y te mantendré informado. ¡Gracias!

trying2survive · 2 Diciembre, 2020 15:30

Solo para aclarar, ¿no hay forma de desbloquear semrushbot y seo spider? Los necesitamos para la auditoría SEO. Intenté eliminar ambos de /admin/customize/robots (también probé con Allow:), pero obtenemos un error 429 en Screaming Frog. ¿O es este error 429 un problema separado? Agradecemos mucho sus comentarios.

Johani · 2 Diciembre, 2020 16:34

Los errores 429 indican que esos rastreadores están siendo limitados por tasa. Discourse tiene cierta limitación de velocidad activada de forma predeterminada para prevenir abusos. Puedes leer más sobre esto aquí.

neounix · 3 Diciembre, 2020 09:35

¿Probaste esto (pero usando el nombre de tu contenedor)?

Nota: también puedes configurar esto en la interfaz de administración (Admin UI):

# docker exec -it socket-only bash
root@socket-only:/# rails c
[1] pry(main)> SiteSetting.blocked_crawler_user_agents
=> "mauibot|semrushbot|ahrefsbot|blexbot|seo spider"
[2] pry(main)> SiteSetting.blocked_crawler_user_agents = ""
=> ""
[3] pry(main)> SiteSetting.blocked_crawler_user_agents
=> ""
[4] pry(main)>

Ver también:

github.com/discourse/discourse

config/site_settings.yml

d1d87b6fa

# Available options:
#
# default            - The default value of the setting. For upload site settings, use the id of the upload seeded in db/fixtures/010_uploads.rb.
# client             - Set to true if the javascript should have access to this setting's value.
# refresh            - Set to true if clients should refresh when the setting is changed.
# min                - For a string setting, the minimum length. For an integer setting, the minimum value.
# max                - For a string setting, the maximum length. For an integer setting, the maximum value.
# regex              - A regex that the value must match.
# validator          - The name of the class that will be use to validate the value of the setting.
# allow_any          - For choice settings allow items not specified in the choice list (default true)
# secret             - Set to true if input type should be password and value needs to be scrubbed from logs (default false).
# enum               - The setting has a fixed set of allowed values, and only one can be chosen.
#                      Set to the class name that defines the set.
# locale_default     - A hash which overrides according to `SiteSetting.default_locale`.
#                      The key should be as the same as possible value of default_locale.
#
#
# type: email    - Must be a valid email address.
# type: username - Must match the username of an existing user.
# type: list     - A list of values, chosen from a set of valid values defined in the choices option.

This file has been truncated. show original

Ver también:

  def self.allow_crawler?(user_agent)
    return true if SiteSetting.allowed_crawler_user_agents.blank? &&
      SiteSetting.blocked_crawler_user_agents.blank?
...
...

github.com/discourse/discourse

lib/crawler_detection.rb

e0d923225

# frozen_string_literal: true

module CrawlerDetection
  WAYBACK_MACHINE_URL = "archive.org"

  def self.to_matcher(string, type: nil)
    escaped = string.split('|').map { |agent| Regexp.escape(agent) }.join('|')

    if type == :real && Rails.env == "test"
      # we need this bypass so we properly render views
      escaped << "|Rails Testing"
    end

    Regexp.new(escaped, Regexp::IGNORECASE)
  end

  def self.crawler?(user_agent, via_header = nil)
    return true if user_agent.nil? || user_agent&.include?(WAYBACK_MACHINE_URL) || via_header&.include?(WAYBACK_MACHINE_URL)

    # this is done to avoid regenerating regexes

This file has been truncated. show original

Puedes ver en el código que si estableces estas dos configuraciones del sitio en “vacío” (blank), no habrá bloqueo:

SiteSetting.allowed_crawler_user_agents
SiteSetting.blocked_crawler_user_agents

Recomiendo que no cambies esto, ya que estos bots que Discourse bloquea por defecto en su núcleo no respetan robots.txt; sin embargo, es tu sitio y puedes hacer lo que quieras. Hay una buena razón por la que están bloqueados en el núcleo.

Dicho esto, Discourse te ofrece la opción de “desbloquear” estos usando tus SiteSettings en la interfaz de usuario.

Tema		Respuestas	Vistas
Handling Bingbot Feature	29	7488	20 Noviembre 2020
MegaIndex bot did about 4,000 pageviews on one day Community Building	40	4659	2 Diciembre 2023
Issues Google Search Console is throwing at me for wrong discourse structure (or some for wrong administration of my site) Support	18	240	18 Diciembre 2024
Why there are lots of Disallow rule in robots.txt? Support	34	4633	22 Diciembre 2020
How to protect myself from bots crawling my Discourse instance? Support	6	1632	17 Enero 2022

¿Por qué semrushbot y ahrefsbot están bloqueados de forma predeterminada?

Temas relacionados