Then you are lucky! I have attached a ResearchGate paper entitled “Virtualized Cyberspace - Visualizing Patterns & Anomalies for Cognitive Cyber Situational Awareness” showing parts of the issue I described!
Also, FWIW, is our “short” list of partial User Agent strings which DO NOT respect robots.txt, which crawl our sites (updated):
AddThis|OPPO A33|Mb2345Browser|UCBrowser|MQQBrowser|MicroMessenger|LieBaoFast|Clickagy|DotBot|Linespider|Applebot|Ask Jeeves|Baiduspider|ADmantX|Spinn3r|rogerbot|YesupBot|ValueClick|Twitterbot|FriendFeedBot|Squider|ContextAd|Voyager|Chattertrap|YandexBot|bingbot|Virtual Reach NewsclipCollector|FlipboardProxy|Flipboard|proximic|YahooFeedSeeker|Xenu|TwitterFeed|GrapeshotCrawler|NewsGatorOnline|Sosospider|OpenISearch|discobot|EasouSpider|FeedDemon|YottaaMonitor|CacheSystem|UnwindFetchor|JikeSpider|Konqueror|Superfeedr|Nachobot|percbotspider|WeSEE:Search|Cliqzbot|Exabot|Wget|TweetedTimes|YoudaoBot|stumbleupon|omgili|BoardReader|Gigabot|trendictionbot|InAGist|DoCoMo|PaperLiBot|YisouSpider|TweetmemeBot|libwww-perl|YandexDirect|CrystalSemanticsBot|httrack|msnbot-UDiscovery|MaxPointCrawler|CrystalSemanticsBot|W3C_Validator|magpie-crawler|Flipboard|flipboa|PostRank|Chrome-Lighthouse|Summify|Sogou|archive.org| UptimeRobot|robot|A6-Indexer|ShowyouBot|crawler|Genieo|Apache-HttpClient|curl|Technoratibot|Feedbin|SensikaBot|SiteExplorer|Digg|Yahoo Pipes|QuerySeekerSpider|Alamofire|AhrefsBot|SeznamBot|Kraken|BomboraBot
The list above is only partial and has not be updated in long time; so it is not “perfect” and is quite “rusty”…
Our direct experience over two decades, including writing a lot of bot detection and visualization code (and publishing a number of papers, videos and presentations on this topic), is that only a handful of bots respect
robots.txt and those who do respect the directives are from big companies like Google, Bing (Microsoft), etc.
The most aggressive bots fake their User Agent string so they appear as “non-bot” User Agents strings.
Furthermore, the most aggressive offenders are bots from China, Russia and Korea; and we have plugin code for our legacy forums which detect these rouge bots based on honey pot techniques and other behavioral patterns. You can see some of the results in the paper attached, which has nice colorful pictures of bots in cyberspace to enjoy.
For example, from our experience and direct cybersecurity visualization research, all the bots listed in the Discourse OOTB robots.txt do not respect robots.txt, including DotBot, semrushbot, and ahrefsbot (we had a very big problem with ahrefsbot, which is highlighted in another presentation, see illustration):
User-agent: seo spider
In the distant past, we used to list these in robots.txt (and still do) these same bots above (plus many more!) in
robots.txt and we found that “just about zero” of the bots listed above respects the robots.txt directives.
You are very lucky if your hosting sites has a different experience!
We have done extensive testing and written a lot of visualization code and we know for a fact, from peer-reviewed research, that most bots do not respect robots.txt and mostly only a handful of “top tech company” bots do respect it.
Although this paper we wrote (below) does not list all bots, it does give you an example of how extensively we have tested and written code (in the Unity gaming engine and on LAMP platforms) in this area:
Have also attached this paper as a reference so no need to download from ResearchGate.
Virtualized_Cyberspace_-_Visualizing_Patterns_Anom.pdf (2.0 MB)
PS: I plan to port a lot of my legacy LAMP bot detection code to Rails in 2021, if I have time!
(also attached below)
anomalies_cyberspace_v01.pdf (3.3 MB)
Example graphic from presentation, showing over 200 Chinese Baidu Bots disguised as regular users (using a normal user UA string, not a “bot string”), pulling a site from Brazil IP addresses (not China).