The audience for forums is changing. Some of your readers aren't actually readers anymore - not in the traditional sense. They're agents reading on someone's behalf, summarizing your content into an answer for a person who might never click through or become an actual member. Whether you run a developer support community, a customer forum, or a fan club, your knowledge is being pulled into AI answers right now.
Genuine question: why should I let AI crawlers flood my server? Of course, the article makes it clear that the choice will always be mine, but, from a commercial perspective and using Reddit as an example of how they handle AI scraping, what would the benefits be here?
Recently, I saw that Google is going to create personalized pages based on users’ history, meaning fewer clicks for webmasters and more money for Alphabet. So, again, what’s the point here?
Currently, I allow search engines and cache indexers, such as the Wayback Machine, to read and cache my content, although I can’t see anything other than providing my users’ content for Alphabet and others to monetize without my community benefiting in any way, not to mention legal issues like LGPD in my country or GDPR in Europe.
Arguably one of the best features of Discourse.
I cringe every time I’m looking for a URL that returns data only on whatever other website, and find out there isn’t any.
It would be great if you could always include links to your sources for statements like this. It’d help readers verify the data
It depends on the purpose of your forum, if it’s a brand or support forum for example… your goal might be to just get people an answer as fast as possible. If the content is incorporated into AI training that could be beneficial. Hopefully if it’s truly an unsolved issue people will still make their way to your site to ask about it, but this is still challenging if they’re going to AI first.
In a more social context AI scrapers are almost entirely useless, because you want people on your community interacting with each other. That might be a good case to try and block them completely.
From my own professional point of view, working with AI and SEO, the impact and importance of llms.txt has not been proven. Recently, Google came out and said they’re not using nor supporting it. Which doesn’t mean other agents won’t. But it’s one nuance I thought it’d share.
I just don’t honestly. Personal opinion, but LLMs have always been prohibited to visit my websites and always will. I don’t enjoy donating my hard work, whether that be text or code, to scrapers, especially that of OpenAI or Anthropic.
Obviously this is all just personal preference but this entire AI craze would be over once people stop allowing these companies to steal their websites content. Maybe the latest Google update people are so against will knock some sense into website owners who will now no longer have any hits to their site.
Unfortunately there’s no foolproof way to block LLM scrapers if your site’s content is publicly accessible, many of them will ignore robots.txt and even try to appear to be a human visitor (using different user agents & IP addresses) to circumvent blocks. Hopefully some sort of legal regulation can put guardrails on the situation, because it appears many people would like a choice of whether or not their content is used this way!