Since you are running in a sub-folder you’re on your own to generate the appropriate robots.txt file since the one Discourse generates is in community/forum/robots.txt (though it seems your case is that robots.txt doesn’t matter if external sites link to a profile?).
You could also enable the hide user profiles from public site setting. That will “Disable user cards, user profiles and user directory for anonymous users.” which would keep Google away from them.
My bad @rbrlortie I didn’t realize this was subfolder, so that is a different animal deserving of its own topic.
My response is pretty much what @pfaffman said, above ↑
Since Discourse does not control the top level of the website, Discourse has no control over robots.txt in this scenario. You’ll need to generate it yourself.
“One thing maybe to keep in mind here is that if these pages are blocked by robots.txt, then it could theoretically happen that someone randomly links to one of these pages. And if they do that then it could happen that we index this URL without any content because its blocked by robots.txt. So we wouldn’t know that you don’t want to have these pages actually indexed.
Whereas if they’re not blocked by robots.txt you can put a noindex meta tag on those pages. And if anyone happens to link to them, and we happen to crawl that link and think “maybe there’s something useful here” then we would know that these pages don’t need to be indexed and we can just skip them from indexing completely.
So, in that regard, if you have anything on these pages that you don’t want to have indexed then don’t disallow them, use noindex instead.”
Since the default Discourse behavior is to attempt to hide those pages from crawlers, in my eyes the feature is broken.
The pages in the default Discourse robots.txt should have the <meta name="robots" content="noindex"> present.
I think the concern is that as Google gives each site a finite amount of crawl time, people want noindex so that none of those resources are squandered crawling and erroring on pages they didn’t want indexed in the first place.
From that perspective alone it begins to make sense. I’ve seen how long it can take for Google to crawl and discover all the content on large freshly indexed sites. It can be a matter of days for Google to crawl it all and much much longer before it figured out where it should be checking more frequently.
The logic doesn’t make sense; robots.txt is specifically there to exclude content from webcrawlers. That is its entire purpose for existing.
After arriving at a website but before spidering it, the search crawler will look for a robots.txt file. If it finds one, the crawler will read that file first before continuing through the page. Because the robots.txt file contains information about how the search engine should crawl, the information found there will instruct further crawler action on this particular site. If the robots.txt file does not contain any directives that disallow a user-agent’s activity (or if the site doesn’t have a robots.txt file), it will proceed to crawl other information on the site.
What Google does is respect that (sort of) but crawl it anyway in a “just in case” manner. So I’d expect this “just in case” crawling is already lower priority than crawling, y’know, what isn’t explicitly excluded from webcrawlers…
Ultimately, the NoIndex directive in Robots.txt is pretty effective. It worked in 11 out of 12 cases we tested. It might work for your site, and because of how it’s implemented it gives you a path to prevent crawling of a page AND also have it removed from the index.