Sitemap.xml for Google Webmaster


(Glenn Drake) #1

Does anybody know how I can generate a sitemap.xml file that I can feed to Google Webmaster?

Would be great to have a text input (for the file name) and a button in the settings area, to loop all the posts and generate an xml file. Here’s one I conveniently prepared earlier.

We might be talking about plugin territory here, not sure?


(Anton Batenev) #2

If you use enable_escaped_fragments (enabled), you can try to use this snippet:

# TODO: limit to 10485760 bytes
# TODO: limit to 50000 urls
# use index file http://www.sitemaps.org/ru/protocol.html#index

desc 'Generate topics sitemap to stdout'
task 'sitemap:full' => :environment do
  sitemap_header
  sitemap_topics
  sitemap_footer
end

def sitemap_header
  puts '<?xml version="1.0" encoding="UTF-8"?>'
  puts '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'
end

def sitemap_footer
  puts '</urlset>'
end

def sitemap_topics
  posts_per_page = SiteSetting.posts_per_page

  topics = Topic.listable_topics.visible.secured
  topics.each { |topic|
    posts = topic.posts.by_post_number

    index = 0
    count = posts.count
    last  = nil

    begin
      post_index = index + posts_per_page - 1
      if post_index >= count
        post_index = count - 1
      end

      post = posts[post_index]

      loc = topic.url
      if index > 0
         loc += '?page=' + (index / posts_per_page + 1).to_s()
      end

      last = post.updated_at.iso8601

      puts ' <url>'
      puts '  <loc>' + loc + '</loc>'
      puts '  <lastmod>' + last + '</lastmod>'
      puts ' </url>'

      index += posts_per_page
    end while index < count

    puts ' <url>'
    puts '  <loc>' + topic.url + '.rss</loc>'
    puts '  <lastmod>' + last + '</lastmod>'
    puts ' </url>'
  }
end

put it to lib/tasks/sitemap.rake and exec as:

RAILS_ENV="production" bundle exec rake sitemap:full > public/sitemap.xml


(Glenn Drake) #3

Thanks, I will give this a go.

I’ve also realised that you can call the API directly from /latest.json?page=1. I could actually write an external application to paginate through the API and it seems that everything I need to build the sitemap.xml is returned in the JSON.


(Sam Saffron) #4

We are open to a PR that adds sitemap support to core provided it is cached for at least an hour and size is culled to something sane.


(Attila Mihaly Balazs) #5

@sam - could you please give the link to the PR? I tried to search GH without luck and it’s a feature I’m interested in too and would like to follow the state of the PR.

Thanks.


(Jonathan Sandlund) #6

What service did you use for yours? I’m having a lot of trouble getting my discourse instance picked up by search engines. Hopefully this will help.


(Glenn Drake) #7

I built a simple .NET application that polls the following endpoint and iterates through the page numbers.

http://community.quickfile.co.uk/latest.json?page=0

I then deserialized the response using JSON.NET and pulled out the bits needed to generate the sitemap XML. As my forum is using a sub-domain I could get away with creating the sitemap.xml file in the root of my main application. After submitting the URI for the sitemap to Google Webmaster, within 2 days everything was indexed.


(Kevin P. Fleming) #8

If generating a site-wide sitemap.xml is going to be required to get Discourse forum content indexed by search engines, that’s definitely a concern. I suppose it could be automated on the server where the Discourse instance lives, but it’s going to be an incredibly common thing to need.


(Jeff Atwood) #9

You really do not “need” a sitemap.xml file. It is a nice to have in some circumstances, but if the Google (and others) webspiders can’t crawl your forum properly by default, you have much deeper problems.

Answer from someone at Google:

The consensus seems to be that you can get new pages indexed slightly faster with a sitemap.xml file, but that’s about it.


Discourse Sitemap Plugin
#10

Hey guys,

in my view, sitemaps are an important piece for indexing any new site.

(recommeded reading: Learn about sitemaps - Search Console Help)

I think that it will help a lot to discover your new topics on a daily basis, and not only by Google, think also in Bing or the next search engine. Also, there is a reason why still you need (and google suggest) to add a sitemap in Webmaster tools, it helps to discover new links.

I’m doing a test right know:

One of my forums is online since early this week, many different new topics are there, and still is not indexed (even considering that I’ve notified this new site in Google Webmaster Tools and Google Analytics).

Now I’ve added a sitemap (using the code provided here).

My guess is that tomorrow morning, it will be indexed, just because I submitted the sitemap.

Will let you know! :smile:


(Kane York) #11

In order to do a real experiment, you would need to do a side-by-side comparison of two nearly-identical sites; create (and submit) the sitemap for one and not the other.


(Jeff Atwood) #12

Read this statement very, very closely:

If Google can’t successfully crawl your site to find a link, but is able to find it in the sitemap it gives the sitemap link no weight and will not index it!

So to the extent that sitemap would be masking regular crawl errors, I don’t recommend that. Fix the default Google crawling first using regular HTML web pages and once you’ve done that, you can perhaps get a slight boost in new link indexing speed by using a sitemap.

It’s basically a micro-optimization, but sitemaps are in no way a substitute for making sure Google can properly and completely crawl your site naturally.


#13

Just to update on this:
My experiement did not work. It only indexed 2 URLs, not related to the sitemap that I’ve uploaded.

I don’t know at the moment what is the best way to index the forum, maybe I need to add links on other sites, and wait for organic growth.

P.S.: This is the site that I’m indexing, and it has many inbound links (reddit, some from bing, etc.). http://cryptocurrenciestalk.com


(Glenn Drake) #14

Patience :smile:

It’s very easy to get all OCD on Google Webmaster


(Khoa Nguyen) #15

Hello. I’m new to rails and is there any one done this :).


(Emma Fu) #16

The newest version of discourse they do not use SiteSetting.posts_per_page parameter. instead they use TopicView.chunk_size.

Also if you have any chance to run “./launcher rebuild app”. The code will be gone.


(Emma Fu) #17

I have submit sitemap.xml for 6 days… Google only index 40% of pages. And I follow google template format.

http://www.heartemma.com/sitemap.xml


(Khoa Nguyen) #18

That is very normal. You just have to wait.


(Pugwash) #19

I originally opened this post (different user). I haven’t submitted a sitemap in over a year and our Discourse community continues to get picked up on Google. I’ve seen new topics appear on Google within 24 hours, so even without a sitemap Discourse is still highly crawlable. Maintaining a current sitemap (at least in my experience) made zero difference.

Also new domains (which appears to be the case with you - no records on web.archive.org) can take months to get any traction on Google. Just keep ploughing in the content and eventually you’ll get indexed, sitemap or no sitemap.


(Anton) #20

How do I make some pages have less weight without a sitemap?
With sitemap, I’d use the <priority> tag.