MegaIndex bot did about 4,000 pageviews on one day

EricGT · February 10, 2022, 10:35pm

FYI

For those that keep an eye on their sites pageviews, on 02/07/2022 our site saw about 4,000 pageview from the bot MegaIndex.ru. It definitely stood out.

IAmGav · February 10, 2022, 10:59pm

you could either block it or slow it down

EricGT · February 11, 2022, 1:05am

Thanks for the info.

I was not asking a question but pointing it out to others to keep an eye open. It appears to be a new crawler that doesn’t spread out its hits over time. Maybe this was the first time it saw our site so was doing all pages but if it continues with this one day massive hits I will investigate more.

codinghorror · February 11, 2022, 1:43am

Thanks for the heads up. These badly written bots / web indexers / web spiders can really crush a server!

Mr.X_Mr.X · February 11, 2022, 5:06am

Noticed it as well. Its the bot that does most pageviews on my instance and right after comes Seekport (35K pageviews in a day) and mj12bot. I’m getting DOS sometimes because of them. Cloudflare anti-bot feature helped me limit most of these bots without much monitoring.

Jonathan5 · February 11, 2022, 8:21am

Is it possible to slow down all crawlers – effectively adding a robots.txt crawl-delay?

Jagster · February 12, 2022, 7:36pm

No. Quite few follows robots.txt at all and even less obey delay.

Jonathan5 · February 12, 2022, 7:50pm

That’s a shame. It would be a good feature for Discourse.

Out of interest, does the existing system (allowing you to block every crawler but only add a crawl delay for a finite list) work via robots.txt disallow and crawl-delay?

That’s a different matter entirely. Though, personally, I have found crawl-delay on another site to be effective.

Jagster · February 12, 2022, 7:56pm

Only with white hat bots, and there is not too many. All others, ratio good vs bad is more or less 1:100, don’t care what do you have or haven’t in robots.txt. Best ones looks it just to find out where a sysadmin/webmaster doesn’t want to shpw and those take that directions rightaway.

(Really, < grin > is acting as html tag Discourse should not use just < > for that, IMO)

SEO-bots are really badly behaving ones. But majority are telling fake user agent made by script kiddies.

One can stop totally plenty of bots but that should do on server, not on app-level.

Jonathan5 · February 12, 2022, 8:00pm

That’s all by the by. My experience has been different and I would like Discourse to allow crawl-delay to be set without having to name individual crawlers.

Canapin · March 2, 2022, 2:05pm

I have a spike of crawlers too.

How can I identify which crawler(s) is abusing the page views?

codinghorror · March 2, 2022, 10:04pm

It’s one of the built in reports on the report page.

Canapin · March 2, 2022, 10:49pm

Thanks, found it.

User Agent	Pageviews
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)	5514
Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)	5212
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)	1427
Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)	872

So these peaks are from MJ21bot and Nexus 5X Build which is a legit Google bot after checking its IP in nginx logs.

Any idea why they would do such pageviews? MJ12bot seems legit as well (at least, that’s what my Google searches say…). Note that the forum is online, but requires a login to see the content. It will be open publicly in a few days.

I sometimes see crawlers peaks on my forums, but they last only one or two days and they go away for a long time.

Examples:

Jagster · March 2, 2022, 11:25pm

Check IPs. It is one of the most used fake ones too. Plus it is totally useless to you, as all so called SEO bots.

Canapin · March 4, 2022, 1:08pm

I know quite nothing about crawlers. Aren’t official Google’s crawlers useful regarding SEO? Sorry if I’m starting to be off-topic.

EricGT · March 5, 2022, 8:19am

As I am the one who started the topic I don’t see your question as off-topic. My post was an FYI and you are just trying to better understand the details of the information.

While I am not an SEO expert, if you want people to find your site using a search engine then you need to allow the search engine crawler to craw your website to build and update its indexes.

The problem is that some crawlers are not leading users to a site and if that is the case and you don’t want excessive page hits then you would ask that they not craw your site using robots.txt. However bad search engines will ignore robots.txt and one will then have to use firewall rules and such. The problem then becomes the age old problem that if someone wants to gain access to an open site (no-login) then it is hard to block them because they change their identity each time. If one goes with login required then often that cuts down on the number of people who will sign up.

With regards to the original post, I have not see another massive one day increase in page views due to MeagIndex or another crawler since the reported outlier.

EricGT · August 13, 2022, 12:21pm

Update: 08/13/2022

The bot visited our site again on 08/04/2022 (crawler site)

Report: Consolidated Pageviews

Report: Web Crawler User Agents

Report: Top Traffic Source

Clearly letting the bot MegaIndex.ru/2.0 index the site does not appear to be generating traffic to the site.
Note: AFAIK yandex.ru is different from Megaindex.ru.

For blocking crawlers there is robots.txt which as noted

https://<Discoruse site>/admin/customize/robots

but not all crawlers will honor robots.txt.

As noted above by IAmGav there are other crawler settings.

Jagster · August 14, 2022, 5:10pm

robots.txt is not for stopping bots. It is guideline for good behaving bots. Those should stop on server level. One the biggest reasons why my discourse is behind reverse proxy.

EricGT · January 11, 2023, 8:51am

On 12/22/2022 https://bot.seekport.com which is a new unknown bot to me did an inordinate amount of pageviews

kinetiksoft · January 11, 2023, 12:16pm

Semi-regular crawlers activity spike is a usual thing. We ourselves divide those by:

Regular crawlers by legitimate search engines
Irregular crawlers by new/custom search engines
Targeted crawlers by competitors or any other “researchers” which may effectively use your crawled data for their purposes.

Based on our experience, there is no need to take care and protect yourself from being crawled unless you don’t want your information to be used for any purpose or you experience severe server loads because of that. In the end, if your forum/project is public, there will always be a way to gather your public data for any purpose

Topic		Replies	Views
Pageviews from Anonymous Users have exploded but Google Analytics showed no traffic growth. How to find about where the increase come from? Data & reporting	23	2346	January 5, 2021
Has anyone seen the OpenAI web crawler GPTBot visit their site? Community	11	1878	June 23, 2025
Anonymous views suddenly very high Data & reporting	40	888	June 27, 2025
Sudden drop in traffic Community	40	4099	December 15, 2022
Handling Bingbot Feature	29	7386	November 20, 2020

MegaIndex bot did about 4,000 pageviews on one day

Related topics