FYI
For those that keep an eye on their sites pageviews, on 02/07/2022 our site saw about 4,000 pageview from the bot MegaIndex.ru. It definitely stood out.
FYI
For those that keep an eye on their sites pageviews, on 02/07/2022 our site saw about 4,000 pageview from the bot MegaIndex.ru. It definitely stood out.
Thanks for the info.
I was not asking a question but pointing it out to others to keep an eye open. It appears to be a new crawler that doesn’t spread out its hits over time. Maybe this was the first time it saw our site so was doing all pages but if it continues with this one day massive hits I will investigate more.
Thanks for the heads up. These badly written bots / web indexers / web spiders can really crush a server!
Noticed it as well. Its the bot that does most pageviews on my instance and right after comes Seekport (35K pageviews in a day) and mj12bot. I’m getting DOS sometimes because of them. Cloudflare anti-bot feature helped me limit most of these bots without much monitoring.
Is it possible to slow down all crawlers – effectively adding a robots.txt crawl-delay
?
No. Quite few follows robots.txt at all and even less obey delay.
That’s a shame. It would be a good feature for Discourse.
Out of interest, does the existing system (allowing you to block every crawler but only add a crawl delay for a finite list) work via robots.txt disallow
and crawl-delay
?
That’s a different matter entirely. Though, personally, I have found crawl-delay
on another site to be effective.
Only with white hat bots, and there is not too many. All others, ratio good vs bad is more or less 1:100, don’t care what do you have or haven’t in robots.txt. Best ones looks it just to find out where a sysadmin/webmaster doesn’t want to shpw and those take that directions rightaway.
(Really, < grin > is acting as html tag Discourse should not use just < > for that, IMO)
SEO-bots are really badly behaving ones. But majority are telling fake user agent made by script kiddies.
One can stop totally plenty of bots but that should do on server, not on app-level.
That’s all by the by. My experience has been different and I would like Discourse to allow crawl-delay to be set without having to name individual crawlers.
I have a spike of crawlers too.
How can I identify which crawler(s) is abusing the page views?
It’s one of the built in reports on the report page.
Thanks, found it.
​ User Agent | ​ Pageviews |
---|---|
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 5514 |
Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) | 5212 |
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 1427 |
Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) | 872 |
So these peaks are from MJ21bot and Nexus 5X Build which is a legit Google bot after checking its IP in nginx logs.
Any idea why they would do such pageviews? MJ12bot seems legit as well (at least, that’s what my Google searches say…). Note that the forum is online, but requires a login to see the content. It will be open publicly in a few days.
I sometimes see crawlers peaks on my forums, but they last only one or two days and they go away for a long time.
Examples:
Check IPs. It is one of the most used fake ones too. Plus it is totally useless to you, as all so called SEO bots.
I know quite nothing about crawlers. Aren’t official Google’s crawlers useful regarding SEO? Sorry if I’m starting to be off-topic.
As I am the one who started the topic I don’t see your question as off-topic. My post was an FYI and you are just trying to better understand the details of the information.
While I am not an SEO expert, if you want people to find your site using a search engine then you need to allow the search engine crawler to craw your website to build and update its indexes.
The problem is that some crawlers are not leading users to a site and if that is the case and you don’t want excessive page hits then you would ask that they not craw your site using robots.txt. However bad search engines will ignore robots.txt and one will then have to use firewall rules and such. The problem then becomes the age old problem that if someone wants to gain access to an open site (no-login) then it is hard to block them because they change their identity each time. If one goes with login required then often that cuts down on the number of people who will sign up.
With regards to the original post, I have not see another massive one day increase in page views due to MeagIndex or another crawler since the reported outlier.
Update: 08/13/2022
The bot visited our site again on 08/04/2022 (crawler site)
Report: Consolidated Pageviews
Report: Web Crawler User Agents
Report: Top Traffic Source
Clearly letting the bot MegaIndex.ru/2.0 index the site does not appear to be generating traffic to the site.
Note: AFAIK yandex.ru is different from Megaindex.ru.
For blocking crawlers there is robots.txt which as noted
https://<Discoruse site>/admin/customize/robots
but not all crawlers will honor robots.txt.
robots.txt is not for stopping bots. It is guideline for good behaving bots. Those should stop on server level. One the biggest reasons why my discourse is behind reverse proxy.
On 12/22/2022 https://bot.seekport.com which is a new unknown bot to me did an inordinate amount of pageviews
Semi-regular crawlers activity spike is a usual thing. We ourselves divide those by:
Based on our experience, there is no need to take care and protect yourself from being crawled unless you don’t want your information to be used for any purpose or you experience severe server loads because of that. In the end, if your forum/project is public, there will always be a way to gather your public data for any purpose