Discourse Sitemap Plugin

(Vinoth Kannan) #41

I am also doing the changes on Sitemap plugin. I will create a new PR soon. Currently I am working on the 4 points which are recommended by @sam in previous post.

5 Likes
(Sam Saffron) #42

PR was merged and the new server outlet was added to beta and stable, so now the plugin can have all the code that overwrites robots.txt in bulk amended without worrying about backwards compat.

5 Likes
(Vinoth Kannan) #43

@codinghorror I just finished all 4 changes from @sam post and created a PR.

5 Likes
(Tomek) #44

After recent updates my https://forum.dobreprogramy.pl/newssitemap.xml returns 404.

(Vinoth Kannan) #45

Sorry. I forgot to mention it. News sitemap location is changed as /sitemap/news.xml.

So your URL should be https://forum.dobreprogramy.pl/sitemap/news.xml

3 Likes
(Tomek) #46

Great, this works - thanks.

BTW the news sitemap returns en error in Google Webaster tools:

This XML tag has an invalid value.
Parent tag: publication
Tag: language
Value: pl_PL

(Vinoth Kannan) #47

https://support.google.com/news/publisher/answer/74288?hl=en

The <language> is the language of your publication. It should be an ISO 639 Language Code (either 2 or 3 letters). Exception: For Chinese, please use zh-cn for Simplified Chinese or zh-tw for Traditional Chinese.

So as per google’s documentation language should be like pl instead of pl_PL. But in Discourse we are using language codes with country suffix for many languages. So we have to remove it in sitemap generation.

(cpradio) #48

Not quite. I just fixed this for RSS feeds, you simply need to replace the _ with - and it will be valid. Give me a sec, I’ll find my commit which may help

Here is the PR, I did for RSS

And the conversation around that commit

and

2 Likes
(Vinoth Kannan) #49

But they telling like language code should be 2 to 3 letters. So I guess pl-PL also not valid for sitemaps.

Edit: Anyway I will try your suggestion too :thumbsup: thank you

(cpradio) #50

Does their webmaster tool let you copy and paste in a sitemap? If so, taking an existing one and putting pl-PL as the language, could quickly tell you if it’ll validate.

Also, those exceptions seem, like they would permit the use of the hyphenated approach, but I very well could be wrong.

(Mittineague) #51

That language code is two letters. The other two letters are the country code.

i.e. The language is Polish, the country is Poland

(cpradio) #52

FYI, https://www.google.com/schemas/sitemap-news/0.9/sitemap-news.xsd reports the language element should be

Language of the publication. It should be an ISO 639 Language Code (either 2 or 3 letters); see: ISO 639-2 Language Code List - Codes for the representation of names of languages (Library of Congress) Exception: For Chinese, please use zh-cn for Simplified Chinese or zh-tw for Traditional Chinese. Required.

Which is the same ISO spec RSS supports, so I’d be really surprised if it doesn’t work using a hyphen.

(Mittineague) #53

I think the confusion is about what is meant by “exception”. Most are combinations of language-country. Chinese is more like language-variant.

This would be a bit analogous to russian-latin and russian-cyrillic or german-low and german-high

(Richard - DiscourseHosting.com) #54

That is not totally correct. You should read LL-CC as “Language (ISO 639) LL as it is spoken in Country CC”.
The language code is NOT just the first two letters, since the spelling can differ.

For instance en-us : color and en-gb: colour. So en-us and en-gb are two different languages, the language is not defined by the first two letters alone.

You can’t always reduce things like pl-pl to pl, because for some European languages there is a difference (there is nl, nl-be and nl-nl: Dutch, Dutch as spoken in Belgium, Dutch as spoken in the Netherlands, and another example is pt, pt-br and pt-pt: Portuguese, Portuguese as spoken in Brazil and Portuguese as spoken in Portugal).

Now there is a difference between RSS and sitemaps.

RSS does NOT use bare ISO 639 codes. The list of supported codes for RSS is here RSS Language Codes . You can see that there are many LL-CC type codes.

For sitemaps, like @vinothkannans said, it’s always two letters, and the only exceptions are zh-cn and zh-tw .

1 Like
(Mittineague) #55

lol, it seems l’m the one that’s confused :blush:

So unlike the RSS fix of replacing underscores with hyphens, It needs to use only up to the underscore - except when Chinese - when used in a sitemap’s publication tag.

A bit more logic but not that much more I guess.

1 Like
(Vinoth Kannan) #56

While checking a sample news sitemap using below online validator it is not accepting both pl_PL and pl-PL language codes. It only accepting language code pl without country suffix.

http://tools.seochat.com/tools/site-validator/

Update: @RGJ created a PR to fix this issue.

7 Likes
(Richard - DiscourseHosting.com) #57

I’ve merged the PR, thank you!

1 Like
(Veer) #58

These sitemaps created should be cached, as my site have 1 million posts, it gives error 502 when again and again sitemap is created.

(Sam Saffron) #59

@vinothkannans can you investigate the issue reported by @veer, I thought you added caching

(Vinoth Kannan) #60

It have cache already. Is there any problem below?