Pageviews from Anonymous Users have exploded but Google Analytics showed no traffic growth. How to find about where the increase come from?

In the last two weeks, the pageviews (PV) from anonymous users have exploded. However, Google Analytics (GA) shows a different story. As you can see, GA even showed a slight dip. I love seeing the growth but would love to learn more about where the sudden anonymous users PV come from?

Is there a way to see the referring sites for anonymous users?

I found this earlier post: Is the info Top Referred Topics/ Top Traffic Sources stored in a table in the database? - #9 by simon. Is this the right step to take?

1 Like

Hi @zhenniwu

This is not “growth”. Your site was more-than-likely visited by a rouge-bot who does not follow any robots.txt rule and is already flagged by Google as a “rouge bot” and so their traffic is filtered from their stats.

There is little-to-nothing you can do about this unless you are willing to invest a lot of time and energy into bot-detection and blocking, which is mostly a waste of time (for most people).

This is just “life on the net as we know it” and nothing to even pay any attention too, for the most part.

2 Likes

@neounix Thank you so much for your pointers! It confirmed our suspicion. We’re due for an update with Discourse, and hope that will rate-limit the rouge bots. We’ll continue monitoring the PV that come from anonymous users.

Btw, do you know if there is a way to confirm whether it’s a bot or not? Thank you so much for helping us!

1 Like

Hey @zhenniwu

You have already confirmed it is a bot by looking at the behavior and realizing it is an anomaly.

Detecting bots is easy when the user agent (UA) string of the client claims “I AM A BOT” in one way or the other. However, “rogue bots” do not declare they are bots in their UA strings and so we must detect bots and similar activity based on their behavior.

You can write code to automate this if you want; however, detecting all classes of bots is non-trivial because there are many different behavior characteristics of bots; not only by hit rate (as you are seeing).

Before you start building a detection solution, you must ask yourself “what are you trying to accomplish by detecting them?”.

Why do you care? @zhenniwu

OBTW, here is a July 2017 article by an editor at Research Gate on this very topic. Enjoy!

https://www.researchgate.net/blog/post/researchers-render-cyberspace-in-3d-like-a-video-game-to-make-identifying-threats-easier

2 Likes

Do you think this is the robots change we made for Google also @sam?

2 Likes

Could certainly be the case but the only way to know for sure it to see the actual traffic.

2 Likes

The vast majority of bots do not respect robots.txt.

In fact, many rogue bots read robots.txt to gain information where admins do not want bots to visit, and then they attempt to get info from those areas!

In other words, robots.txt is not effective for controlling the behavior for 99.9% (just pick a big percentage) of bots on the net; and it also can expose info about “sensitive” areas of a site as well.

1 Like

This is not true based on 7 years of our hosting business. I am sure rogue bots are out there but they are far from common.

3 Likes

I see. So this explosion might not be rouge bots.

@codinghorror @sam, we’re happy to provide our data for you to do any analysis and debugging. Just let me know what do you need, and I’ll send it over to you. Thank you in advance!

2 Likes

Hi Jeff!

Then you are lucky! I have attached a ResearchGate paper entitled “Virtualized Cyberspace - Visualizing Patterns & Anomalies for Cognitive Cyber Situational Awareness” showing parts of the issue I described!

Also, FWIW, is our “short” list of partial User Agent strings which DO NOT respect robots.txt, which crawl our sites (updated):

AddThis|OPPO A33|Mb2345Browser|UCBrowser|MQQBrowser|MicroMessenger|LieBaoFast|Clickagy|DotBot|Linespider|Applebot|Ask Jeeves|Baiduspider|ADmantX|Spinn3r|rogerbot|YesupBot|ValueClick|Twitterbot|FriendFeedBot|Squider|ContextAd|Voyager|Chattertrap|YandexBot|bingbot|Virtual Reach NewsclipCollector|FlipboardProxy|Flipboard|proximic|YahooFeedSeeker|Xenu|TwitterFeed|GrapeshotCrawler|NewsGatorOnline|Sosospider|OpenISearch|discobot|EasouSpider|FeedDemon|YottaaMonitor|CacheSystem|UnwindFetchor|JikeSpider|Konqueror|Superfeedr|Nachobot|percbotspider|WeSEE:Search|Cliqzbot|Exabot|Wget|TweetedTimes|YoudaoBot|stumbleupon|omgili|BoardReader|Gigabot|trendictionbot|InAGist|DoCoMo|PaperLiBot|YisouSpider|TweetmemeBot|libwww-perl|YandexDirect|CrystalSemanticsBot|httrack|msnbot-UDiscovery|MaxPointCrawler|CrystalSemanticsBot|W3C_Validator|magpie-crawler|Flipboard|flipboa|PostRank|Chrome-Lighthouse|Summify|Sogou|archive.org| UptimeRobot|robot|A6-Indexer|ShowyouBot|crawler|Genieo|Apache-HttpClient|curl|Technoratibot|Feedbin|SensikaBot|SiteExplorer|Digg|Yahoo Pipes|QuerySeekerSpider|Alamofire|AhrefsBot|SeznamBot|Kraken|BomboraBot

The list above is only partial and has not be updated in long time; so it is not “perfect” and is quite “rusty”… :slight_smile:

Our direct experience over two decades, including writing a lot of bot detection and visualization code (and publishing a number of papers, videos and presentations on this topic), is that only a handful of bots respect robots.txt and those who do respect the directives are from big companies like Google, Bing (Microsoft), etc.

The most aggressive bots fake their User Agent string so they appear as “non-bot” User Agents strings.

Furthermore, the most aggressive offenders are bots from China, Russia and Korea; and we have plugin code for our legacy forums which detect these rouge bots based on honey pot techniques and other behavioral patterns. You can see some of the results in the paper attached, which has nice colorful pictures of bots in cyberspace to enjoy.

For example, from our experience and direct cybersecurity visualization research, all the bots listed in the Discourse OOTB robots.txt do not respect robots.txt, including DotBot, semrushbot, and ahrefsbot (we had a very big problem with ahrefsbot, which is highlighted in another presentation, see illustration):

User-agent: DotBot
Disallow: /

User-agent: mauibot
Disallow: /


User-agent: semrushbot
Disallow: /


User-agent: ahrefsbot
Disallow: /


User-agent: blexbot
Disallow: /


User-agent: seo spider
Disallow: /

In the distant past, we used to list these in robots.txt (and still do) these same bots above (plus many more!) in robots.txt and we found that “just about zero” of the bots listed above respects the robots.txt directives.

You are very lucky if your hosting sites has a different experience!

We have done extensive testing and written a lot of visualization code and we know for a fact, from peer-reviewed research, that most bots do not respect robots.txt and mostly only a handful of “top tech company” bots do respect it.

Although this paper we wrote (below) does not list all bots, it does give you an example of how extensively we have tested and written code (in the Unity gaming engine and on LAMP platforms) in this area:

https://www.researchgate.net/publication/320008976_Virtualized_Cyberspace_-_Visualizing_Patterns_Anomalies_for_Cognitive_Cyber_Situational_Awareness

Have also attached this paper as a reference so no need to download from ResearchGate.

Enjoy!

Virtualized_Cyberspace_-_Visualizing_Patterns_Anom.pdf (2.0 MB)

PS: I plan to port a lot of my legacy LAMP bot detection code to Rails in 2021, if I have time!

See also:

https://www.researchgate.net/publication/314356740_Patterns_Anomalies_in_Cyberspace

(also attached below)

anomalies_cyberspace_v01.pdf (3.3 MB)

Example graphic from presentation, showing over 200 Chinese Baidu Bots disguised as regular users (using a normal user UA string, not a “bot string”), pulling a site from Brazil IP addresses (not China).

1 Like

Curious to see what the changes are. Is there a commit/CL?

1 Like

Yep, and our customers would be absolutely screaming bloody murder if this was the case, because they are effectively charged per pageview. Rogue bots doing excessive pageviews are something that cost them money, and would cause them to leave our hosting platform. That’s why we throttled Bing so heavily, for example – feel free to have a search if you’re curious.

So yes, our 7+ years of hosting experience so far has demonstrated that rogue webcrawler / bots, while they do exist, aren’t a significant problem.

(I’d say the same for Stack Overflow, which is a top 100 web property that I co-founded.)

4 Likes

Hey Jeff!

Great conversation!

I was on a conference call with the CFO of one of the largest technical ad networks headquartered in NYC not long ago, and he told me they (and their advertisers) considered bot traffic (rouge and otherwise) one of their top concerns and they spend a lot of money on this very topic (classifying legitimate user traffic from bot traffic).

So you are indeed very lucky if your websites are not experiencing the same issues that plague Wall Street and their advertisers who fight this constantly.

Many businesses I work in cybersecurity and anti-fraud over the past two decades have had just the opposite experience as you describe, to be candid.

Well done, Jeff!

1 Like

OBTW, you might find this of interest. It is “dated” (five years ago) but the problem has not gotten “better” since 2015:

Quote from CSOonline above (2015):

“Good bots” accounted for 36 percent of traffic this year, up from 21 percent last year. “Bad bots” were responsible for 23 percent of traffic this year, down slightly from 24 percent last year – not because volumes were down, Essaid repeated, but because the number of “good bots” rose dramatically. Human traffic was just 41 percent, down from 55 percent last year.

The company defines “bad bots” as those that don’t respect “robots.txt” files and don’t provide value to the sites they visit.

I will try to find some references which are closer to the year 2020 and post back, since 2015 is a bit dated at five years ago!

My experience with cybersec customers is that the “bad bot” traffic numbers are much higher in 2020 than in the 2015 CSO report above; so anyone who does not have a “bad bot” problem is very fortunate indeed! We have written a lot of “detect and classify bad bots” over the last decade and it’s a pain as bot programmers get more “tricky” :slight_smile: and are good at changing their UA strings (along with bot timing and behavior) to look like legitimate human traffic (long before CloudFlare existed).

It is really good to hear from Jeff that Discourse sites are basically immune to this “bad bot” traffic and do not need complex bot detection code to mitigate the problems others struggle with.

Is all Discourse hosting behind CloudFlare? CloudFlare is designed to protect against this.

1 Like

As I said, we’d literally be out of business if what you described (wildly pervasive rogue bots pulling down millions of pages per second) was true, so I guess it’s a kind of miracle! I’m not sure how to explain this conflict between what you believe to be true, and the actual business realities I’ve experienced at Stack Overflow (2008-2012) and Discourse (2012-today).

On the other hand, ad networks and bots is a very different conversation – since bots that pretend to be users and click ads is a way to print “free” money for the bot writers.

Perhaps the difference is that most of our customers don’t rely on ads? And even at Stack Overflow, display ads was a small part of the business. Might be a good idea to keep that crucial difference in mind when you think about this.

5 Likes

Hi Jeff,

FYI, it is fairly common knowledge, not my personal opinion, that bot traffic exceeds human traffic on the Internet.

It is also common knowledge, not my personal opinion, that a large percentage of bot traffic is by bots which do not respect robots.txt. Some estimate at least half, my experience is that it is “site and subject dependent”.

I’m glad you have a different experience at the companies you have founded and built and very happy for you.

On the other hand, the fact on the Internet is that bot traffic in 2020 is about between 55 and 60 percent of all traffic; and out of that bot traffic, around perhaps half is from bot traffic which does not respect robots.txt. Some research will. put the figure of “bad bots” to as low as 35 percent of all traffic, some higher, depending on the research. I’m not making this up, it is well documented.

If you have research papers or statistics outside of your experience hosting at Discourse or at your prior experience; showing “bad bot traffic” is minuscule in nature, I would be very pleased to read it; because personally, I have never seen any research paper or referenced article which states that “bad bot” traffic is so minuscule like you are stating here.

I apologize if not agreeing with you displeases you. I have provided references and can provide more references (not my opinion), if you are open to the facts about Internet traffic.

Otherwise, I’ll stop posting on this topic and so not to annoy you :slight_smile: as I don’t want to be disagreeable with you over something you have a strong opinion about on a forum which I have no admin power :slight_smile:

Happy Holiday Season!

1 Like

Maybe for the ad networks who are locked in mortal combat with bots and fake clicks for advertising dollars. But at Stack Overflow and Discourse? It is largely a non issue.

If you enjoy arguing hypotheticals based on theories, by all means knock yourself out. Spend all day every day theorizing to your heart’s content. I hope this theorizing brings you great joy and happiness in your life! In the meantime we have businesses to run, so I prefer to make decisions based on the actual data we have gathered at our actual businesses. I guess I am a little crazy that way. Sorry if you find that bothersome or perplexing.

Have a wonderful rest of your day!

4 Likes

Hmm… I might be missing something, but the research that you linked above does not actually seem to show general trends across the web.

It seems to be focused on displaying traffic to a site in a way that makes spotting and quantifying… questionable… traffic a fairly simple visual exercise, which seems interesting in itself. However, there is no indication of what sites were represented, nor even what types of site. It’s hard to evaluate whether the instances shown are representative of the web as a whole.

Note: I’m not questioning whether Bot traffic is huge in general, nor whether there are a lot of “bad” bots… but the (googleable) statistics seem to have a bit of a spread from the search result that you screenshot.

What might be more useful would be a statistical analysis of what kinds of sites tend to be aggressively targeted by what type of bots. (I’d expect, for example, that FaceBook and similar platforms attract a disproportionately large amount of attention from a certain segment of these bots. Another segment probably goes after ad-heavy sites pretty exclusively.)

3 Likes

Hi Jeff,

If you want to paint me as a “nutty theorist” who has no idea about network operations on the Internet, then so be it; but nothing can be further from the truth, as anyone who knows me, already knows :slight_smile:

The OP had a spike. It was more-than-likely caused by a bot. I think we can agree on that :slight_smile:

Have a great day, Jeff and a Fantastic Holiday!

Also, thank you for introducing me to Ruby on Rails. If it was not for you and Discourse, I would not be writing Ruby code everyday (outside of Discourse) and that was the best technical thing that happened to me in 2020! I just love Ruby.

Thank you again, Jeff!

1 Like

Hi @Sailsman63

I provided some supplemental references in a number of areas; and have not posted, nor claimed to post, a detailed work or survey of all Internet traffic in every operational scenario.

In my view, any engineer who spends at least 60 minutes of research on the Internet who has reasonable research and analytical skills will (1) find many references to operational reports (not theory) how much percentage of network traffic on the Internet is attributed to bots and (2) find a number of references which also quantify how much of that traffic is by “bad bots” who do not respect robots.txt.

This is not “theory” or “my idea”. It is a well established fact and this fact is not hidden from anyone who cares to look into it; and from an operations point of view, we see the same every day when we analyze log files and process traffic behavior on web sites, like setting up honey pots that only bots can find (normal human users never go there) and so only bots go there, etc.

I have set up many “honey links” on web site and have trapped many bots in my days; and so this is not just something I just made up “out of the blue”, LOL :). Others on the net have done the same (it is a common cybersecurity technique), it’s not just me, I promise you :slight_smile:

Have a nice day!

1 Like