Disable or bypass feature detect for Googlebot (while serving JS app to crawlers)

I’ve been trying to get Discourse to fully serve the JS app to Googlebot - getting very close.

Courtesy @pfaffman and executing the below code (in the rails console) I was able to get the JS app to show up when using chrome and spoofing the user agent to googlebot or googlebot smartphone

SiteSetting.non_crawler_user_agents="trident|webkit|gecko|chrome|safari|msie|opera|goanna|discourse"+'rss|bot|spider|crawler|facebook|archive|wayback|ping|monitor|lighthouse'

However, when I test with the google mobile friendly tool ( or URL inspection in Google search console) it will give me a blank screenshot with the below HTML

Bing is similar but with them it shows the content. I think bing shows the content because it’s crawler is not “mobile”. Relevant post here by @sam

According to @david and this post it would seem the “feature detect” is the culprit

I’m cautiously optimistic there’s a simple workaround. Every 10-20 tries when using the tool, Googlebot will render the APP properly.

My theory is that since googlebot is notorious for not downloading every resource when accessing a page, that specific JS ( which has the feature detect and causing the revert ) doesn’t get loaded and hence the page looks good

So in conclusion, how would someone go about disabling the feature detect for googlebot (or if its easier, all crawlers/bots)?

Edit Just in case I’m off with the terminology, when “feature detect” is mentioned on meta, is that referring to the browser detection ?( perhaps with files likes browser-detect.js and other dependencies)

Or is “feature detect” a broad phrase for what Discourse does when it tries to understand the technology that’s trying to access the app.

Is there a reason why you want to serve the JS version to Googlebot? Google probably won’t be able to find paginated list views, including the paginated home page and topics that have more than a certain number of posts. In the bot view, the topic lists are crawlable, but Googlebot probably isn’t going to trigger the endless scrolling.

Glad you asked, yes, I think it’s the reason for my google “soft penalty”

Allow me to elaborate.

We had a very sloppy site update around sept/oct 2019 , the main site tanked right then and there.

We never recovered. The site has never been better as far as SEO. Sure its not perfect but we’re light years ahead of some of the competition. Sites that use our many years’ old images and text outrank us by pages. We’re on the 3rd and they perhaps are top of 2nd page.

I’ve been through countless SEO blogs, videos, posts and even had some back and forth with John Mueller (on Reddit )

The most I got out of him was, it could be “quality issues”. We had improved the main site dramatically since Jan 1 of this year. Not even a blip in organic traffic.

Discourse: I had it installed back in 2013 and forgot about it. Barely would check its traffic.

If you look at the main site analytics, you’ll see a sharp drop towards the end of the chart. This is when I started working on Discourse.

When trying prerender.io on discourse the rank for the main site was all over the place. Sometimes jumping 10-15 spots overnight, then back. (I have since stopped prerender as they couldn’t render the main menu , login etc. )

From what I read online, this is a sign google doesn’t know where to place you. They say just a little “more” and you’re on the good side of the algorithm.

Nothing we’ve done in the last 3 years has triggered these fluctuations in the SERPs.

(Messing with Google disavow tool, cleaning up code, clean URLs, site structure, internal linking, social, content, etc.)

You might make the argument, why didn’t google penalize you in 2018? ( you had discourse on the subdomain then too)

Well, I think it was a multitude of factors unique to the site, it’s history, link profile. that caused it to tank in late 2019. Seems that google reshuffled the site rank and perhaps gave the discourse URLS more weight than what it gave previously.

And the thing is … I love Discourse. Especially now that I’ve been on meta more, all these cool plugins and features I had no idea existed. Wiki, subscription payments, table of contents, and now the chat !!

So moving away from discourse is not really an option, too much invested at this point.

I did consider this and I’m willing to take my chances. I know it won’t be perfect but from what I read and watch , Google has gotten really good at understanding JS as of late.

They even deprecated the ajax crawling scheme

Times have changed. Today, as long as you’re not blocking Googlebot from crawling your JavaScript or CSS files, we are generally able to render and understand your web pages like modern browsers.

Side note: Discourse has a setting for ajax crawling- I guess that has to eventually go


So the plan is to serve the app to Google, do my best to fix any SEO issues that may arise and enjoy the spike in traffic.

I can then report the results on meta and make the case that Discourse should consider optimizing the JS for Google.

For example, maybe something like this (from google blog) would help with the pagination and scrolling concerns.

And keep the non-crawler version for old browsers.

If I may add… :smirk:

Before I ever brought sending the JS version to google I was tinkering with it.

I tested sending the JS version to google around the beginning of April or so. I remember it returning a result most of the time ( even if it was broken looking ) . Using the google mobile tool.

I thought it might be this commit - I made the code edits , rebooted and same behavior.

Perhaps someone remembers a PR or commit in the past couple months that may have altered browser and/or crawler detection?

Edit Sorry for all the updates, the more info the better, amirite?

While trying prerender last month , Google ended up adding 2000 urls to the forum coverage. ( mostly these URLS )

They were all served in .005 seconds, prerender had the URLS cached and ready for the googlebot to access. So it took them all quickly

Point is, perhaps the crawler got “very used” to the no JS and commited resources to get those 2k pages.

So now its accessing the site in this manner until it figures things out (and needs to access with JS more) just a theory

Were you working on something that changed the way Discourse gets crawled, like try to use prerender on it then?

If you check your landing pages report in GA, does it give any clues about which part of the site was affected?


For the main site, if John Muller suggested that there are quality issues, I’d go through their quality docs and ask yourself if any of them apply.

From a quick look, you have redirect chains from the 2019 site that might be longer than Google can crawl.

One candidate for a sudden penalty is that URLs from the 2019 site have 5 redirects, but Google says to keep it “less than 5” or they might not follow the redirects. That might have made it appear to Google that the old pages disappeared from the Web.

Example:

curl -sSL -D - http://flynumber.com/virtual-phone-number/united-states_alexandria_1-318 -o /dev/null -H 'User-Agent: header_bot'

The redirects are probably the easiest to fix. Instead of doing them like this:

/a/b/c/d/e/final-destination

I’d do it like this:

/a/final-destination
/b/final-destination
/c/final-destination
/d/final-destination
/e/final-destination

(It looks like you also have doorway pages, and automated synonymizing, but I’d try to fix the simpler problem first, since that might be enough.)

Thanks, Josh, appreciate the feedback here.

Nice catch and while poorly executed, taking many months , - Google seems to have figured out what pages mean what.

In other words, eventually, I started to see more and more of the pages that I 301’ed to for the keywords used on the old pages

This makes a lot of sense and will see how I can get that impelelnted- presently the search console doesnt show the crawler getting 301 to often. Seems when the rank gets better, they will follow more 301. Causation without correlation perhaps.


It’s totally not a knock on Discourse - I’m just not easily convinced with " thousands of Discourse users have great organic traffic"

Google is not really going to tell us either.

We must always remember Google is an algorithm, they’re not looking at this from a human’s eyes.

While both versions share similar content, and google knows it’s not malicious cloaking- they still have to adjust rank.

One version looks way better, works better, and gives some sense of internal link structure. The other is a glorified RSS feed.

Google has no idea I have this slick forum that works on all [modern] devices, truly encourages discourse, and is one of the coolest things the internet has ever created.

I always like to use the “Powered by discourse” do-follow link in the crawler version. ( just because its easy )

Again, I know not malicious but you must look at it through Google’s eyes. You FlyNumber ( not https://community.cloudflare.com/ ) are giving us this crawler version with an external link you are not showing regular browsers.

I could totally see the algorithm picking up on whats going on and ignoring the external link for the cloudflare domain (as it’s such an authority)

It’s not like what google applies to cloudflare will apply to me.

Did someone pay you for this external link you show bots ( but don’t want to show regular users? ) is more about how they may look at the site. I’m not saying it’s this - but it’s a possibility you’ll want to eliminate.

In simplest terms, the crawler version doesn’t have a menu or any real structure.

That’s the content the algorithm thinks you want to serve to end-users.

From a very general perspective, I can’t see the algorithm rewarding that.

Maybe it’s time we start considering a real overhaul of the crawler version. At least add the main menu, and suggested topics on the bottom.

Interesting update Google has added “JSON” to the “file types” in crawl stats for my discourse instance. “Javascript” is a separate “file type”.

Will follow closely but would still love for this to render properly in the google tool.

I’m starting to think my logic was flawed from the beginning. It would explain why no one responded - perhaps nothing is wrong.

Here’s a fresh article on how it’s normal for Google to show a white page in the screenshot

I can see the “crawled” HTML for the home page now , this is the indexed version, not from “Live test”- it shows the full page. Keep in mind, Google figured this out while serving them the full JS app.

What’s interesting is they went down to about the 27th post on the home page as far as indexing. So the endless scroll thing is something Google understands.

Not sure if it helped, but I unchecked the ajax setting in admin settings. It caused google to find URLs like the below ( and serve the crawler version ) - I unchecked it, and now that URL will show the JS version

https://discuss.flynumber.com/t/japan-phone-numbers-disconnect-notice/2351?_escaped_fragment_=

Now all I need to figure out is how I can clean up those extra canonical URLs discourse creates for the user pages.