MegaIndex bot did about 4,000 pageviews on one day

FYI

For those that keep an eye on their sites pageviews, on 02/07/2022 our site saw about 4,000 pageview from the bot MegaIndex.ru. It definitely stood out.

image

7 Likes

you could either block it or slow it down

5 Likes

Thanks for the info.

I was not asking a question but pointing it out to others to keep an eye open. It appears to be a new crawler that doesn’t spread out its hits over time. Maybe this was the first time it saw our site so was doing all pages but if it continues with this one day massive hits I will investigate more.

4 Likes

Thanks for the heads up. These badly written bots / web indexers / web spiders can really crush a server!

13 Likes

Noticed it as well. Its the bot that does most pageviews on my instance and right after comes Seekport (35K pageviews in a day) and mj12bot. I’m getting DOS sometimes because of them. Cloudflare anti-bot feature helped me limit most of these bots without much monitoring.

5 Likes

Is it possible to slow down all crawlers – effectively adding a robots.txt crawl-delay?

2 Likes

No. Quite few follows robots.txt at all and even less obey delay.

2 Likes

That’s a shame. It would be a good feature for Discourse.

Out of interest, does the existing system (allowing you to block every crawler but only add a crawl delay for a finite list) work via robots.txt disallow and crawl-delay?

That’s a different matter entirely. Though, personally, I have found crawl-delay on another site to be effective.

2 Likes

Only with white hat bots, and there is not too many. All others, ratio good vs bad is more or less 1:100, don’t care what do you have or haven’t in robots.txt. Best ones looks it just to find out where a sysadmin/webmaster doesn’t want to shpw and those take that directions rightaway.

(Really, < grin > is acting as html tag :thinking: Discourse should not use just < > for that, IMO)

SEO-bots are really badly behaving ones. But majority are telling fake user agent made by script kiddies.

One can stop totally plenty of bots but that should do on server, not on app-level.

2 Likes

That’s all by the by. My experience has been different and I would like Discourse to allow crawl-delay to be set without having to name individual crawlers.

2 Likes

I have a spike of crawlers too.

image

How can I identify which crawler(s) is abusing the page views?

4 Likes

It’s one of the built in reports on the report page.

3 Likes

Thanks, found it.

​ User Agent ​ Pageviews
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 5514
Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) 5212
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 1427
Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) 872

So these peaks are from MJ21bot and Nexus 5X Build which is a legit Google bot after checking its IP in nginx logs.

Any idea why they would do such pageviews? MJ12bot seems legit as well (at least, that’s what my Google searches say…). Note that the forum is online, but requires a login to see the content. It will be open publicly in a few days.

I sometimes see crawlers peaks on my forums, but they last only one or two days and they go away for a long time.

Examples:

3 Likes

Check IPs. It is one of the most used fake ones too. Plus it is totally useless to you, as all so called SEO bots.

3 Likes

I know quite nothing about crawlers. Aren’t official Google’s crawlers useful regarding SEO? Sorry if I’m starting to be off-topic.

3 Likes

As I am the one who started the topic I don’t see your question as off-topic. My post was an FYI and you are just trying to better understand the details of the information.

While I am not an SEO expert, if you want people to find your site using a search engine then you need to allow the search engine crawler to craw your website to build and update its indexes.

The problem is that some crawlers are not leading users to a site and if that is the case and you don’t want excessive page hits then you would ask that they not craw your site using robots.txt. However bad search engines will ignore robots.txt and one will then have to use firewall rules and such. The problem then becomes the age old problem that if someone wants to gain access to an open site (no-login) then it is hard to block them because they change their identity each time. If one goes with login required then often that cuts down on the number of people who will sign up.

With regards to the original post, I have not see another massive one day increase in page views due to MeagIndex or another crawler since the reported outlier.

2 Likes

Update: 08/13/2022

The bot visited our site again on 08/04/2022 (crawler site)

Report: Consolidated Pageviews

image

Report: Web Crawler User Agents

Report: Top Traffic Source

Clearly letting the bot MegaIndex.ru/2.0 index the site does not appear to be generating traffic to the site.
Note: AFAIK yandex.ru is different from Megaindex.ru.


For blocking crawlers there is robots.txt which as noted

https://<Discoruse site>/admin/customize/robots

but not all crawlers will honor robots.txt. :slightly_frowning_face:


As noted above by IAmGav there are other crawler settings.

4 Likes

robots.txt is not for stopping bots. It is guideline for good behaving bots. Those should stop on server level. One the biggest reasons why my discourse is behind reverse proxy.

4 Likes

On 12/22/2022 https://bot.seekport.com which is a new unknown bot to me did an inordinate amount of pageviews

2 Likes

Semi-regular crawlers activity spike is a usual thing. We ourselves divide those by:

  • Regular crawlers by legitimate search engines
  • Irregular crawlers by new/custom search engines
  • Targeted crawlers by competitors or any other “researchers” which may effectively use your crawled data for their purposes.

Based on our experience, there is no need to take care and protect yourself from being crawled unless you don’t want your information to be used for any purpose or you experience severe server loads because of that. In the end, if your forum/project is public, there will always be a way to gather your public data for any purpose :slight_smile:

3 Likes