Google scans deep pages (SEO ISSUE?)


(Spooky) #1

Google is crawling and indexing deep page URLs although there is a canonical tag for the URL without the page numbers. Anything I can do to stop deep linking and make links appear without the page number.

For example:

Google crawled the page:

http://domain.com/topic/5/2

Although it has canonical url of:

http://domain.com/topic/5

I want only links to:

http://domain.com/topic/5

There are around 4 pages that the URL is indexed with a page number instead of the original page. I have no idea how Google detects those pages, because I haven’t linked to them at all.

I used the command “site:” in google to list the indexed pages and there I saw the pages with the page numbers indexed. There reason might be that Google scrolls the javascript page to load more content, and that’s how it discovers the pages.

If Google sees it as a single page via the noscript tag, I can’t understand how he founds about the inner pages.

Any help would be appreciated. Thanks.


(Spooky) #2

I found out that even if I am logged out, there are links to the last page of the topic. The link exists in a span tag with class ‘posts’. For example:

   <div itemprop='itemListElement' itemscope itemtype='http://schema.org/ListItem'>
      <meta itemprop='url' content='http://meta.discourse.org/t/bootstrap-failure-due-to-lack-of-locales-in-container/46714'>
      <a href='/t/bootstrap-failure-due-to-lack-of-locales-in-container/46714' itemprop='item'>
        <span itemprop='name'>Bootstrap failure due to lack of locales in container</span>
      </a>
      
        <span class='category'>[<a href='/c/installation'>installation</a>]</span>
      <span class='posts' title='posts'>(<a href="/t/bootstrap-failure-due-to-lack-of-locales-in-container/46714/7">7</a>)</span>
    </div>

This is how Google finds about the pages links and index them. Is there an option not to list those links so Google won’t crawl them?


(Felix Freiberger) #3

I’m not sure which links you are talking about – it looks like Google is only indexing links with page numbers as it should:


(Mittineague) #4

I just looked at a few pages with JavaScript turned off and I couldn’t find mark-up like that.

What page and where on the page exactly are you seeing that?


(Spooky) #5

Maybe I explained it wrong. The number added to the URL is not the page, but the post number inside the topic. So sorry if I confused you.

What this means, is that Google crawls URL with the number that represents the post number at the of the URL. I wanted to know if it’s possible that the links to inner posts in the topic won’t appear in the page, so Google will only crawl the main URL.


(Felix Freiberger) #6

Well – it looks like Google does not crawl these links with post numbers (because they are not shown to Google).
Can you point to a concrete search demonstrating your problem?


(Spooky) #7

Here’s how google index the forum:

https://www.google.com/#q=site:forum.vrgamesfor.com&safe=off&start=10

You can see that it indexes pages with the inner page post number at the end, some without, some with.


(Spooky) #8

I used Google Webmaster tools to crawl the main URL, and it does see a link as follows:

<span class='posts' title='posts'>(<a href="/t/can-the-screen-brightness-be-adjusted-for-playstation-vr-headset/69/5">5</a>)</span>

Which means that the inner post link is visible to Google Crawler.


(Felix Freiberger) #9

Indeed, here are some links like http://forum.vrgamesfor.com/t/ps4-neo-whats-the-benefits-for-vr/90/4 in there.

I’m not sure where Google got them from (most likely, Google found that link on an external site). The page does contain <link rel="canonical" href="http://forum.vrgamesfor.com/t/ps4-neo-whats-the-benefits-for-vr/90">, however, so that should be fine.


(Mittineague) #10

I’m guessing it got those links from somewhere else.

Please provide an exact location where you’re seeing links like that here while not logged in and with JavaScript off.


(Felix Freiberger) #11

It looks like this is where these links are hiding:


(Spooky) #12

Even meta.discourse.org has these type of links

<span class='posts' title='posts'>(<a href="/t/our-latest-experimental-branch-es6-modules-text-rendering/46815/5">5</a>)</span>

So it’s not just my forum. I first thought it’s from a plugin, but then I saw the same on this forum.


(Spooky) #13

Yes Felix, these are the links with the last post number. If Google webmaster tools crawler sees them, they will be indexed. I know that there is a canonical tag, but as you can see, they aren’t always obeyed and I just prefer these links not to be there, because it confuses the crawler and create different links. I really want to just have regular page links not with the post number at the end. Any suggestion?

I have the following plugins installed:

  • discourse-details
  • discourse-solved
  • Spoilter Alert!
  • discourse-topic-previews
  • docker_manager
  • lazyYT
  • poll

(Spooky) #14

the discourse-topic-previews plugins has these lines:

        $excerpt.on('click.topic-excerpt', () => {
          var topic = this.get('topic'),
              url = '/t/' + topic.slug + '/' + topic.id;
          if (topic.topic_post_id) {
            url += '/' + topic.topic_post_id
          }
          DiscourseURL.routeTo(url)
        })

So it might due to the plugin, what do you say?

I’m currently removing that plugin and will test things further.

update: I’ve removed that plugin, but the links are still there. It’s something native to the app, because the class ‘posts’ element with those links is added by Discourse. Google does see it (Webmaster tools tested), so therefore it crawl these links.

The question is why this tag is added in the first place when the user is not logged in:

<span class='posts' title='posts>

Am I missing something, anybody with any suggestions?


(Spooky) #15

home page of: forum.vrgamesfor.com

Check the source code and you’ll see this:

<span class='posts' title='posts'>(<a href="/t/can-the-screen-brightness-be-adjusted-for-playstation-vr-headset/69/5">5</a>)</span>

meta.discourse have the same code when javascipt is disabled. So I assume that this is something that need to be fixed, or at least allow users to disable the link in the admin.


(Mittineague) #16

All very interesting.

But again, I can’t find them, help a fellow out,

WHERE in Discourse are you seeing those?


(Spooky) #17

URL: [quote=“Mittineague, post:16, topic:46735”]
WHERE in Discourse are you seeing those?
[/quote]

URL: https://meta.discourse.org

Where: Inside the noscript tag

Example:

<span class='posts' title='posts'>(<a href="/t/our-latest-experimental-branch-es6-modules-text-rendering/46815/5">5</a>)</span>

Just search for all the instances of the string class=‘posts’ and you’ll find the links.

Note: I visit the page as an anonymous user (not logged in). The same is for my own forum. Google crawls these links. I tested in Google Webmaster tools.

You can also see these links easily if you browser the page with javascipt disabled in Chrome.


(Mittineague) #18

Thanks. The Latest pages topic list.

It does seem odd that some would be only topic links, others topic pages, and others topic posts.


(Spooky) #19

Anyone from Discourse help with this one. Thanks.


(Neil Lalonde) #20

The search results look good to me. Are you asking that only the first post in a topic be crawled? Ignore all replies?