Why I'm getting lots of searches from Chinese websites?

Yassine_Yousfi · July 1, 2020, 10:04pm

there are way more than that in the list, any ideas?

RGJ · July 1, 2020, 10:12pm

I saw this in the logs of one of our clients today so this is more than a coincidence.

EDIT no I think it is a coincidence, searching for ymwears .cn shows more complaints about referral spam, for example these (over one year old) Relevanssi shows weird search queries on my page | WordPress.org and Block specific referrer or agent to enter url | WordPress.org

pfaffman · July 1, 2020, 10:19pm

I had a client complaining about these last month. I blocked a few IPs and considered configuring fail2ban to block IPs that searched for some of those URLs, but never actual did anything about it. I looked in to blocking by geographic region, but it seemed that you’d need a $20/month database to accomplish that.

Yassine_Yousfi · July 3, 2020, 3:19am

interesting are you guys aware of any solution that might work without having to mess around with the server itself…

@pfaffman @RGJ

RGJ · July 3, 2020, 5:33am

Referrer spam is a pretty big issue which even the big guys (i.e. Google Analytics) are not fighting 100% successfully. For now all I can think of is removing these entries manually.

Since these sites are apparently - at least partially - the same on multiple independent Discourse sites (given the fact that our screenshots are showing pretty much the same list) maybe a (dynamic) blacklist would be an idea? @codinghorror do you have a suggestion?

neounix · July 3, 2020, 5:42am

We have seen, addressed and mitigated this issue at a large scale for years and have found the more reliable way (over the past few years) to block rouge bots in to block based on the user agent (UA) string (sometimes in combo with geoip information).

We have blocked 100s of millions of Chinese bots hits over the years, and we have rarely found blocking IP addresses works over time nearly as well as blocking clients based on UA strings.

Here is a snippet of one piece of code we use on one of our sites as an example:

$user_agents = explode('|',$string_of_bad_user_agents,-1);
$hide_content_useragent = $_SERVER['HTTP_USER_AGENT'];
$IS_A_BAD_BOT = FALSE;

foreach($user_agents as $hcag) {
    trim($hcag);
    if (preg_match("/$hcag/i", "$hide_content_useragent")) {
        $IS_A_BAD_BOT = TRUE;
        break;
    }
}

Most all (not all) rouge bots use UA strings which can be fairly easily identified and blocked (in this era, not sure about in the future as things evolve); and so we abandoned the method of trying to block rouge bots based on IP addresses years ago. The reason we abandoned blocking based on IPs is because many countries, like China, Russia, N. Korea, and many more, now run their bot farms from servers in other countries. IP addresses are not a good indicator of actual origin or intent. In addition, by blocking massive IP address blocks, good addresses can be blocked denying access by legitimate users.

For example, China runs huge server farms of bots out of Brazil and other countries closer (geographically to the US) to disguise their origin and to retrieve data faster (shorter Internet reach).

Sometimes the WHOIS data will back match a Chinese, N. Korean or Russian (examples) physical address but other times they will not and will use local country physical addresses. We have seen a lot of rouge Chinese bots registered to Brazilian companies (over the past few years) where we could see (and confirm) the user agent strings matched rouge bots out of China. In addition, when we perform doing Google searches on those UA strings, we see others have also identified many of the same UA strings as Chinese (for example).

In summary, while many people immediate go to blocking IP addresses to block rouge bot activity; most sophisticated bot farms are very good at running their bots out of other countries. It’s easy to set up a VPS in most countries and of course, the closer the bot is to the targeted country, the more data the bot can scrape. VPS can come and go in minutes and bot software can be deployed very quickly in just about any VPS data center globally.

For the past few years, blocking based on UA string has proven to be the more reliable method (sometimes in combo with geoip info, sometimes not); but of course spammers, bot masters, and their agents are also beginning to disguise the UA strings as they have their IP addresses (for many years).

Hope this helps.

Cheers & Happy Bot Hunting!

RGJ · July 3, 2020, 7:02am

Yes, I absolutely agree that IP blocking is not effective.

User agent blocking tends to work pretty well, except when the spammers are constantly changing it.

That’s why my thoughts went to just blacklisting the actual URL that is being referrer spammed.

It just “feels better” because we’re not blocking something based on an underlying assumption (i.e. “this user agent supplies a bad referrer so we don’t trust it”) but we are actually blocking what we want to block (“we see this website being referrer spammed on more Discourse sites, let’s not put it in our database”). At least this is harder to circumvent.

neounix · July 3, 2020, 7:24am

Good thoughts.

There is no one size fits all to stop rouge and malicious bots; and each site must evaluate which controls work best for them.

On a similar note…

Sites which rely mainly on blacklists and spam or rogue bot databases can also have problems because let’s say someone does not like site www.our-arch-rival.com because that site is a competitor (or simply made us angry or offended us). Some people will then submit the site www.our-arch-rival.com to a blacklist or DB and then other sites will filter a legitimate site because of this kind of “bad consequence” from the method of blacklisting DBs.

Then, of course advocates of blacklists will stay, “you can go to the blacklisting sites and submit a report and asked to be removed”, but for many busy, long standing sites, that can be a time killer.

Generally, it is important to analyze the problem and create a mitigation strategy based on the scenario because every “adversary” is different. It’s the old “Know Your Enemy” from Sun Tzu and the Art of War. Every situation is a bit different in the real-world and unfortunately, it does take analysis skills for sys admins to create optimal mitigation strategies against malicious or unwanted cyber activity.

This is also a good reason to run Discourse behind a reverse proxy because the reverse proxy is a good place for analyzing, classifying and controlling malicious activity before this traffic hits the Discourse app.

It can be a full time job in the year 2020 trying to control and mitigate rouge bots and other malicious activity in cyberspace. As soon as admins come up with one good detection and mitigation strategy, the spammers and scrappers will adjust and find ways around it. I tend to advise people to oversize their servers to insure they have enough headroom since these kinds of problems in cyberspace are only going to get worse over time, not better.

Ready Player One!

RGJ · July 3, 2020, 11:52am

Which is another reason to stay away from IP blacklisting: the spammers will know you are taking measures.

Yassine_Yousfi · July 3, 2020, 4:13pm

I think can block most spammers through Cloudflare but not sure what to put in rules for the browser Agent

@neounix what do you mean by “UA strings”? and how can they be used in Cloudflare firewall rules?

pfaffman · July 3, 2020, 6:45pm

But this isn’t even referrer spam, is it? It’s just that they are doing a search for that URL, so it’s not actually doing anything, is it? Do I totally misunderstand what that report is? It’s not available to anyone but admins, right?

markersocial · July 3, 2020, 8:59pm

Think you’re right @pfaffman the report seems to be just for searches made on the forum. It includes the CTR as well, which wouldn’t make sense if it’s a referrer report.

RGJ · July 3, 2020, 9:55pm

No, technically this is not referrer spam, but I’m not sure if there is a word for this exact kind of abuse. I think this is very close to referrer spam but only for a search query report instead?

Referrer spam is never doing anything, it’s only meant to show up in reports.

neounix · July 3, 2020, 10:22pm

@Yassine_Yousfi

Here you go…

In HTTP, the User-Agent string is often used for content negotiation, where the origin server selects suitable content or operating parameters for the response. For example, the User-Agent string might be used by a web server to choose variants based on the known capabilities of a particular version of client software. The concept of content tailoring is built into the HTTP standard in RFC 1945 "for the sake of tailoring responses to avoid particular user agent limitations.”

The User-Agent string is one of the criteria by which Web crawlers may be excluded from accessing certain parts of a website using the Robots Exclusion Standard( robots.txt file).

As with many other HTTP request headers, the information in the “User-Agent” string contributes to the information that the client sends to the server, since the string can vary considerably from user to user.[5]

Reference:

@Yassine_Yousfi. there are myriad references on the Internet about HTTP user agent (UA) strings and how to use them in various ways, including as a sensor when detecting bots and other malicious cyber activity.

Happy Bot Hunting!

Notes:

You can see the Discourse view of bot user agents here (some UA strings are truncated):

https://discourse.your-great-domain.com/admin/reports/web_crawlers

No detection algorithm can detect all bots with 100% accuracy.
You can also get the UA strings from your web server log files and other methods.

Topic		Replies	Views
How do I stop someone from spamming "Trending Search Terms"? Support	24	1960	December 13, 2022
Pageviews from Anonymous Users have exploded but Google Analytics showed no traffic growth. How to find about where the increase come from? Data & reporting	23	2325	January 5, 2021
View IP address of guests / anonymous visitors? Data & reporting	13	1331	January 13, 2022
MegaIndex bot did about 4,000 pageviews on one day Community	40	4440	December 2, 2023
Why there are lots of Disallow rule in robots.txt? Support	34	4514	December 22, 2020

Why I'm getting lots of searches from Chinese websites?

Related topics