OpenAI has created a web crawler named GPTBot.
As an Discourse admin checked the <site>/admin/reports/web_crawlers report and have not seen it yet.
Curious if others have seen it in the wild
OpenAI has created a web crawler named GPTBot.
As an Discourse admin checked the <site>/admin/reports/web_crawlers report and have not seen it yet.
Curious if others have seen it in the wild
I have (and just blocked it).
Note… I have seen a misguided sentiment out there of:
Just block it
This is a one way relationship
I feel this is missing one important point. Having OpenAI crawl meta.discourse.org has been highly beneficial for CDCK. When you ask GPT 4 Discourse questions it has at least a fighting chance of answering them.
It is a two way relationship:
You give Open AI access to data
Open AI burn forests training the LLM on your data, which can result in value for you.
Also related: How to prevent community content from being used to train LLMs like ChatGPT?
We see some GPTBot access across our fleets, maybe 20-40x less traffic than we see from Googlebot
Anyone uncomfortable with it can block in the Discourse UI direct, but the bot appears to be very well behaved compared to some bad ones we have seen.
For those wanting to identify some of the bad ones, as some of us find them we note them in this post.
Yes, first time to use the crawler report too, and lo and behold. There it was.
My take is it appeared in August, and it’s the largest crawler of them all.
Here is an example of a 24 hour period and the kind of ratio
#1 ChatGPT 18K pageviews
#2 mj12bot 1.8K pageviews
…
#4 Google 1.7K pageviews
This discourse deployment was put into long_required specially to stop dead the crawler getting at the content, so it must only be hitting the login_required page to clock up those hits right?
Could it use a user?
I assume that is technically possible but not likely probable, and if so I would expect such a user to suddenly have a really high post read count.
Right now it looks to be close to 100K pageviews far in excess of the next highest which is approx. less than half.
chapgpt crawler is a monster
Is your #3 unidentified? I have one of those as well. It only shows as “—” in the list. It’s also #3 on my list, but pageviews from bots are a lot fewer on my login required private forum.
No, well yes kinda, see I couldn’t read it as it was truncated but I think it is AppleWebKit crawler. I’d need to export the data to read the full entry.
Since then I have blocked virtually all crawlers even though it is the same as you in a login_required private forum. Crawlers have dropped to 20 so far today, compared to nearly 14,000 a few days ago!
On your dashboard: admin/reports/web_crawlers
will show web crawlers for the past 30 days. Hovering over each crawler temporarily shows the full description for each without having to export the list. Change to view the past day using the calendar on the upper right and click Refresh.
So far in the past 24 hours I had 3 crawlers (the 1st is the worst):
PetalBot - petalsearch.com/bot/petalbot - 4 views
GPTBot - OpenAI Platform - 3 views
— - (no description) - 1 view
Over the course of 30 days, PetalBot crawls the most, followed by Yandex.
I see it now it’s about 15 lines down. I added “—” as a crawler to the block list, it’s very low compared to the most egregious, but let’s see what happens
I have almost 50 listings since Jan, but amazingly ChatGPT in under 2 weeks or so is more than double the 2nd highest bot for the entire period from Jan up to today, at that rate ChatGPT equal almost 3 million page views for an entire year if the rate remained, 7/8K a day.
Just added grammarly to the block list!
If anyone is interested, here’s the range of IPs GPTBot (OpenAI) uses as published on their website. They have 9 IPs listed.