Noscript tag and some search engines


(Anton Batenev) #1

Yandex is the biggest search engine in Russia but it doesn’t index content inside html tag noscript. Due to that forum isn’t indexed by crawler, only topic headers but not topics. Maybe some of other local search engines in other countries may follow the same indexing policy. Here’s an example for bad indexing meta forum - no content snippets.

Now I can do ugly global hack in app/views/layouts/application.html.erb - replace noscript tag with div:

<style type="text/css">#noscript { display:none; }</style>
<noscript>
  <style type="text/css">#noscript { display:block; }</style>
</noscript>

<section id='main'>
  <div id='noscript'>
     ...
  </div>
</section>

But I’d like to do this hack only for one User-Agent which contains +http://yandex.com/bots string and leave original good template for normal search engines like Google :trollface:. Is it possible?


(Alejandro Petroff) #2

Generally it’s a bad idea to give different content to users and search engines (i.e. use “cloacking”). Normally, search engines dislike this practise. It’s better to solve mentioned issue on a global level. And Yandex is indeed the biggest search engine not only in Russia but in Ukrain, Kazakhsan, some East European countries etc


(Alejandro Petroff) #3

Here’s yandex documentation translation for webmasters “Indexing AJAX-sites”:

Source: Индексирование AJAX-сайтов

Indexing AJAX-sites

Yandex robot can index AJAX-site, if the structure of the site is subject to certain rules.

Each indexed AJAX-page must be HTML-version. To show the presence of the robot HTML-version of the page, turn on the exclamation mark in the URL of the page:

http://www.example.ru/ # blog> http://www.example.ru/ #! blog

Finding the link to the page combination of “#!”, The robot will ask HTML-version.

HTML-version of each AJAX-page should be available at the address to which the combination of “#!” Is replaced by the parameter «? escaped_fragment =». For example the HTML-version of the page is located at http://www.example.ru/?escaped_fragment=blog.

All references found robot replaces the combination of “#!” On the parameter «? escaped_fragment =» And refers to a changed address (links containing “#!” Can also be used to map the site.)

HTML-version of the home page should be available at with the addition of the parameter «? escaped_fragment =». For example: http://www.example.ru/?escaped_fragment =. Please note that the parameter value should be empty.

To report a robot on HTML-version of the homepage, the page code to include meta tag .

This meta-tag can be used on any AJAX-page. For example, if the page is available at example.ru - This website is for sale! - EXAMPLE SPORTS Resources and Information. and contains meta tag , the robot will index the HTML-version of the page at http:// example.ru - This website is for sale! - EXAMPLE SPORTS Resources and Information.?escaped_fragment =.

In the HTML-version of the document meta-tag should not be placed: in this case, the page will not be indexed.
Link in the search results will direct the user to the AJAX-page version.


(Jeff Atwood) #5

We do not and will not use hash bangs, which are essentially deprecated on modern HTML5 sites.

So that “solution” will not work.

Better to move noscript for that user agent, better still if Yandex starts following the leader Google in this area. :smile:


(Anton Batenev) #6

OK, but what do you think about system setting which enable this workaround only (and only!) if site owner really (and really!) need it? I made pull request to show what I mean:


(Alejandro Petroff) #7

Nice work. Unfortunately we, webmasters, are under dictatorship of search engines and their rules which are sometimes very strange. Forum user content i.e. texts are our main treasure…


(Robin Ward) #8

Wouldn’t this mean sacrificing Google support? They have strict anti-cloaking laws.

Does Google not matter to people who want their sites to be indexed by Yandex?


(Anton Batenev) #9

This solution is not cloacking - content always identical to crawlers and users. But, when option is enabled, it add meta-tag into html head section:

<meta name="fragment" content="!">

If crawler knows what this meta-tag mean, it use this information and get static page adding ?_escaped_fragment_= to url. If crawler don’t know this meta-tag, it simple ignore it.

See agreement between web servers and search engine crawlers by Google for more information.

Yandex market share is about 60% in Russia (Google has about 30%) - priorities are clear. Also, Google knows about this meta-tag and I think it will not be upset.


(Robin Ward) #10

Oh I see! I apologize that I didn’t understand this at first.

This is actually quite a good solution. I remember wanting to try something like this before we went ahead with the current method.

If google works properly with this approach too, maybe we should consider not putting the content into the body at all. It would lower our HTML payload as we’d only have to render that content when requested by a search engine.

This seems like a really good idea to me. Thoughts @sam?


(Sam Saffron) #11

Personally, I still like having the noscript solution around and enabled by default, I am open to adding a switch to disable it if people want to save the 5-8% page weight (tested this page 10547 vs 11701 gzipped)

The noscript solution can work for crawlers that have no idea about that meta tag and gives us something at least for js disabled (screen reader an so on).


(Robin Ward) #12

Screen readers these days can crawl our site fine :smile:

Additionally, @codinghorror previously told me the only people we should really care about are Google. Should we really default to sending 5-8% more for other search engines that nobody is using? Even bing supports the AJAX crawling API.

Also, for logged in users who aren’t cached, that ERB rendering just won’t happen. Surely that’s got to shave off some rendering time on the server side too.

Assuming this all works as advertised, I’m cool to make <noscript> serving an option as long as we default it off.


(Sam Saffron) #13

The first part of the video is showing off how cool it is that you don’t need JS to see the site. I don’t want to lose that cool.

Anyway, I will defer to @codinghorror here on the setting of the default, I strongly want noscript support default on.


(Anton Batenev) #14

Agree. Noscript behavior by default now works fine in most cases (especially for the case when user hasn’t browser with js support like lynx - looks really cool). And I think you don’t not need to break current noscrpt support by default.

The workaround above not breaks this default behavior, just transparently add new ability for old crawlers. If you agree to add it to master, we could testing it in our production enviroment and provide results (or correct code if something wrong with Yandex indexation).

You can mark this setting as experemental (like ssl support) and potentialy danger (like force_hostname) for other users, who dont’t need it.


(Sam Saffron) #15

Sure, I would love to include it, left some notes on the PR


(Jeff Atwood) #16

Yeah I concur, I think crawlers that don’t properly handle <noscript> are broken, personally…


(Jeff Atwood) #17

OK some changes coming here.

  1. We will serve a strictly plain vanilla HTML page to detected, whitelisted web crawlers, with the same exact content, just using 1998 era HTML content – no use of <noscript> or data islands or anything else that would interfere with indexing.

  2. On our non-vanilla HTML, regular content served to every other user agent, we will use the <meta> tag that lets newer (but perhaps unknown to our whitelist) crawlers take advantage of

So @abbat your PR will be an always on feature now, though I personally would prefer to add Yandex to the crawler user agent whitelist. I think the top 10 webcrawlers world-wide should be in that whitelist.

@neil here’s the official Google spec for the user agent:

https://support.google.com/webmasters/answer/1061943?hl=en


(Robin Ward) #18

I’ve just added a commit and tests for this.

https://github.com/discourse/discourse/commit/d95887c57d52c6e2b5d1004b06029a0062bf6b24

I would love use to add more whitelists to the CrawlerDetector component:


(Vikhyat Korrapati) #19

The crawler whitelisting from the 16th page of the Bustle Ember.js NYC talk looks pretty good to me. I especially liked the idea of checking for a URL inside parantheses since it turns out a very large number of crawlers have that. Only problem with their implementation is that it doesn’t detect Facebook’s crawler, which contains a URL with the prefix http(s)://.

This is the code on that slide:

I can create a PR to add Twitterbot, curl and checking for a URL to the whitelist if you think it makes sense?


(Robin Ward) #20

That’s very interesting.

I do wonder if it’s a bad idea to give Facebook/Twitter the simple version of the page, because surely they are crawling you for the opengraph/oembed that is present in the default layout?

Or maybe we should move those to the “crawler” layout too and just serve up the simpler version of the page.


(Vikhyat Korrapati) #21

In the case of sharing a topic on Facebook it generates a story like this one:

The text in that is coming from the og:description tag but the image itself is from one of the replies to the post rather than og:image tag which is the poster’s avatar. Stories generated via clicks on a “Like” button default to the og:image tag but that doesn’t really matter since Discourse doesn’t have any.

So at least for the case of Facebook I think it would be better to move the og meta tags to the crawler layout and serve the simple HTML to it.