I was not asking a question but pointing it out to others to keep an eye open. It appears to be a new crawler that doesn’t spread out its hits over time. Maybe this was the first time it saw our site so was doing all pages but if it continues with this one day massive hits I will investigate more.
Noticed it as well. Its the bot that does most pageviews on my instance and right after comes Seekport (35K pageviews in a day) and mj12bot. I’m getting DOS sometimes because of them. Cloudflare anti-bot feature helped me limit most of these bots without much monitoring.
That’s a shame. It would be a good feature for Discourse.
Out of interest, does the existing system (allowing you to block every crawler but only add a crawl delay for a finite list) work via robots.txt disallow and crawl-delay?
That’s a different matter entirely. Though, personally, I have found crawl-delay on another site to be effective.
Only with white hat bots, and there is not too many. All others, ratio good vs bad is more or less 1:100, don’t care what do you have or haven’t in robots.txt. Best ones looks it just to find out where a sysadmin/webmaster doesn’t want to shpw and those take that directions rightaway.
(Really, < grin > is acting as html tag Discourse should not use just < > for that, IMO)
SEO-bots are really badly behaving ones. But majority are telling fake user agent made by script kiddies.
One can stop totally plenty of bots but that should do on server, not on app-level.
That’s all by the by. My experience has been different and I would like Discourse to allow crawl-delay to be set without having to name individual crawlers.
Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)
872
So these peaks are from MJ21bot and Nexus 5X Build which is a legit Google bot after checking its IP in nginx logs.
Any idea why they would do such pageviews? MJ12bot seems legit as well (at least, that’s what my Google searches say…). Note that the forum is online, but requires a login to see the content. It will be open publicly in a few days.
I sometimes see crawlers peaks on my forums, but they last only one or two days and they go away for a long time.
As I am the one who started the topic I don’t see your question as off-topic. My post was an FYI and you are just trying to better understand the details of the information.
While I am not an SEO expert, if you want people to find your site using a search engine then you need to allow the search engine crawler to craw your website to build and update its indexes.
The problem is that some crawlers are not leading users to a site and if that is the case and you don’t want excessive page hits then you would ask that they not craw your site using robots.txt. However bad search engines will ignore robots.txt and one will then have to use firewall rules and such. The problem then becomes the age old problem that if someone wants to gain access to an open site (no-login) then it is hard to block them because they change their identity each time. If one goes with login required then often that cuts down on the number of people who will sign up.
With regards to the original post, I have not see another massive one day increase in page views due to MeagIndex or another crawler since the reported outlier.
Clearly letting the bot MegaIndex.ru/2.0 index the site does not appear to be generating traffic to the site.
Note: AFAIK yandex.ru is different from Megaindex.ru.
For blocking crawlers there is robots.txt which as noted
https://<Discoruse site>/admin/customize/robots
but not all crawlers will honor robots.txt.
As noted above by IAmGav there are other crawler settings.
robots.txt is not for stopping bots. It is guideline for good behaving bots. Those should stop on server level. One the biggest reasons why my discourse is behind reverse proxy.
Semi-regular crawlers activity spike is a usual thing. We ourselves divide those by:
Regular crawlers by legitimate search engines
Irregular crawlers by new/custom search engines
Targeted crawlers by competitors or any other “researchers” which may effectively use your crawled data for their purposes.
Based on our experience, there is no need to take care and protect yourself from being crawled unless you don’t want your information to be used for any purpose or you experience severe server loads because of that. In the end, if your forum/project is public, there will always be a way to gather your public data for any purpose
I had 2 occasions of a spikes, 8th & 18th of January - both times from Yandex, the Russian web crawler. Both times attempted crawls went up more than double. The biggest snoop over time is petalbot from PetalSearch.com. They had between 4x-6x the number of scans than Yandex and other bots.