Discourse Sitemap Plugin

(Daniel Marquard) #21

Sorry about that. With the fix added on your repo, I’ll just remove this one.


Is there an option that it will exclude some category’s and not list them on sitemap.

(Richard - DiscourseHosting.com) #23

Nope (except for making the categories non-public).
PR’s are welcome!

(Tomek) #24

Hi, I have installed your plugin, but my https://forum.dobreprogramy.pl/sitemap.xml returns error 502 bad gateway ;( I have 3m+ posts on my forum. How can I diagnose this?

PS https://forum.dobreprogramy.pl/newssitemap.xml works fine.

(Mittineague) #25

I’m fairly sure for that many you will need a site index xml file with multiple map files.
Or limit the map to (I think it’s) 50K
At least if you want all of them.

(Tomek) #26

Is it possible? I don’t see any option for this.

(Richard - DiscourseHosting.com) #27

See the last bullet in the first post:[quote=“RGJ, post:1, topic:40348”]
Ideas for improvements:
Make it honor the 50,000 URL limit7 if there are that many topics

PR’s are welcome! :slight_smile:

(Tomek) #28

OK, so with large number of posts this plugin simply does not work yet ;(

BTW with newssitemap.xlm on the other hand I have errors with language value:

(Jeff Atwood) #29

It sounds like there are severe performance problems with this plugin, in the way that it iterates across all posts constantly – it would be good to get a community PR to fix it up.

(Jeff Atwood) #30

@vinothkannans if you are available I would be happy to include this in the other work you are doing; having someone with a solid background in Discourse plugins take a look might help.

1 Like
(Vinoth Kannan) #31

Sure, I will do :+1:

As per above discussions I am going to implement 50K URLs limit on sitemap file

(Sam Saffron) #32

Technically my objections here are:

  • There is no built in extensibility to robots.txt generation, this needs to be added so it does not overwrite it completely (instead plugin should inject the instructions). Otherwise, each time we touch robots.txt plugin needs to be updated.

  • The way it iterates using the object model is inefficient for so many records, instead this should be carefully plucking the data it needs or even possibly using raw sql.

  • Nothing is cached which is risky, instead … something should be cached. Perhaps site map should be re-generated daily or something along those lines.

  • Sitemap is most useful for gigantic sites, it should use a sitemap index and multiple sitemap files (splitting site into 50k chunks)

(Vinoth Kannan) #33

To fix this issue I created a simple PR in Discourse core.

(Mitchell Krog) #34

Only pointed Google to my site last week, no sitemap given to Google as I am not using this plugin and all 99 pages and category pages indexed are all indexed and appearing in search results.

(Tomek) #35

With big forums you need a sitemap for Google to index new posts/topics more quickly. Without sitemap the crawler can get busy indexing old posts and sometimes it’s difficult to get new ones indexed. Also sitemap is very helpful during migrations, I have just moved a forum with way over 3m posts and having a sitemap would speed things up during reindex.

1 Like
(Jeff Atwood) #36

It may marginally speed it up, but I invite you to turn off JS and set your user agent to Google webcrawler and see how Discourse looks. :wink:

There is in fact a very comprehensive and simple to crawl sitemap, it’s called “turn off JS (or don’t) and set your browser’s user-agent to Google webcrawler”.

Again: do NOT take my word for it. I am a notorious liar. Try it yourself and verify.

(Sam Saffron) #37

I think the pain point is around back catalog on very busy big sites.

Our pages are a moving target. Page 1234 today maybe be page 1237 tomorrow. Google can have trouble working through a 10,000 page site to find all content, cause it will not do so in one go. It only indexes some of the pages and picking up where it left off risks content holes and content overlap.

So, the biggest pain point is the duration of time it takes to backfill a giant site into the search engine, cause Google needs to be inefficient about it. For regular updates Google can quite easily just look at N pages on latest, provided there are no “no bump” edits being done (for example back-catalog wiki)

(Tomek) #38

That is exactly the point I was trying to make. I think the sitemap is the only way to tell google which content to index faster (new topics) and which topics do not change that often and do not need to be reindexed. We had this issue with IPB and it was solved with a proper sitemap.

(Mitchell Krog) #39

I’ve always been a believer in Sitemaps as it’s been drummed into our heads, I do think perhaps with a big forum there may be some merit to having one. But this is the first site I’ve ever developed and not given Google a sitemap and it gobbled up the entire site perfectly. I think their crawler has grown beyond needing sitemaps. Certainly something I will monitor as my yet tiny site grows hopefully into a Monster of its own … have bookmarked this topic for future ref.

(Richard - DiscourseHosting.com) #40

We’re going to change the Sitemap plugin so it can leverage this server side outlet. Thanks!

The reason why the sitemap plugin was built is that a ‘news sitemap’ is required for Google News.
While we were at it, we wrote the ‘normal’ sitemap code as well.