Sitemap: `lastmod` for sitemaps wrong

In the main-sitemap the lastmod date for underlying sitemaps is wrong:

E.g. see https://meta.discourse.org/sitemap.xml
The dates for sitemap_2.xml to sitemap_5.xml is all the same ‘2024-03-14T14:02:32Z’ - which is exactly ‘3 days ago’.

<sitemapindex>
    <sitemap>
        <loc>https://meta.discourse.org/sitemap_recent.xml</loc>
        <lastmod>2024-03-17T14:02:29Z</lastmod>
    </sitemap>
    <sitemap>
        <loc>https://meta.discourse.org/sitemap_1.xml</loc>
        <lastmod>2024-03-17T14:02:29Z</lastmod>
    </sitemap>
    <sitemap>
        <loc>https://meta.discourse.org/sitemap_2.xml</loc>
        <lastmod>2024-03-14T14:02:32Z</lastmod>
    </sitemap>
    <sitemap>
        <loc>https://meta.discourse.org/sitemap_3.xml</loc>
        <lastmod>2024-03-14T14:02:32Z</lastmod>
    </sitemap>
    <sitemap>
        <loc>https://meta.discourse.org/sitemap_4.xml</loc>
        <lastmod>2024-03-14T14:02:32Z</lastmod>
    </sitemap>
    <sitemap>
        <loc>https://meta.discourse.org/sitemap_5.xml</loc>
        <lastmod>2024-03-14T14:02:32Z</lastmod>
    </sitemap>
</sitemapindex>

Technical issue:

Somehow 3.days.ago is used for sitemap_[2-5].xml as sitemap.last_posted_topic might not return a valid value.

Another point: In last_posted_topic also use bumped_at

Compare with lastmod in topic sitemaps sitemap_[1-5].xml itself:

Untested pseudo-code:

    def last_posted_topic
      [sitemap_topics.maximum(:bumped_at), sitemap_topics.maximum(:updated_at)].max
    end

I am worried that an optimisation here complicates stuff enormously for very little benefit.

Think it through …

Say there are 6 chunks on meta. If a topic from the last chunk is touched… the entire chunk becomes invalid, you got to remove the topic from there and put it in the front chunk.

Optimising here is a little pointless for a site that sees any kind of activity and the dates inside the chunk on the actual topics are fine.

1 Like

It’s not about moving topics into different sitemap-chunks. The topics can stay in the same sitemap-chunk where they are already in.
(The mapping topic-to-sitemap-chunk is arbitrary anyway as the db select-statement with limit has no order defined.)

The bug report is about that the lastmod date of each sitemap-chunk should represent the lastmod date of the latest topic which the sitemap-chunk contains.

The way for Google should be:

  1. Load sitemap.xml
    → Check lastmod of sitemap-chunks and queue sitemap-chunks which need an update
    (lastmod date is newer than last time downloaded)

  2. Load queued sitemap-chunks sitemap_[1-5].xml
    → Check lastmod of topic-urls and queue topic-urls which need an update
    (lastmod date is newer than last time downloaded)

  3. Load queued topic-urls.

If in sitemap.xml the lastmod of the sitemap-chunks is wrong:
→ Google does not queue changed sitemaps-chunks (step 1)
→ Google does not update changed sitemap-chunks in a timely manner (step 2)
→ Google does not update changed topics in a timely manner (step 3)

Right now https://meta.discourse.org/sitemap.xml looks like this:

  • https://meta.discourse.org/sitemap_1.xml
    lastmod: 2024-03-19T12:50:09Z
    All topics inside have older or same date? :github_check:

    • Latest topic: https://meta.discourse.org/t/creating-a-stickypost-for-forum-threads/299967
      lastmod: 2024-03-19T11:03:38Z
  • https://meta.discourse.org/sitemap_2.xml
    lastmod: 2024-03-16T12:59:17Z
    All topics inside have older or same date? :x:

    • Latest topic: https://meta.discourse.org/t/launcher-rebuild-app-error-bootstrap-failed-with-exit-code-125/299538
      lastmod: 2024-03-19T09:17:46Z
  • https://meta.discourse.org/sitemap_3.xml
    lastmod: 2024-03-16T12:59:17Z
    All topics inside have older or same date? :x:

    • Latest topic: https://meta.discourse.org/t/configure-direct-delivery-incoming-email-for-self-hosted-sites/49487
      lastmod: 2024-03-18T18:16:26Z
  • https://meta.discourse.org/sitemap_4.xml
    lastmod: 2024-03-16T12:59:17Z
    All topics inside have older or same date? :x:

    • Latest topic: https://meta.discourse.org/t/video-thumbnails-issue/263595
      lastmod: 2024-03-19T00:00:20Z
  • https://meta.discourse.org/sitemap_5.xml
    lastmod: 2024-03-16T12:59:17Z
    All topics inside have older or same date? :x:

    • Latest topic: https://meta.discourse.org/t/daily-summary-9pm-utc/291850
      lastmod: 2024-03-18T21:14:49Z
  • https://meta.discourse.org/sitemap_recent.xml
    2024-03-19T13:03:41Z
    All topics inside have older or same date? :github_check:

    • Latest topic: https://meta.discourse.org/t/daily-summary-1pm-utc/291852
      lastmod: 2024-03-19T13:02:07Z

Again this is not strictly true … last_mod is meant to be the last date the sitemap was modified not max date of topics.

If a topic dropped out of the sitemap section today and last modified in the chunk is a week a go… the chunk changed today. A topic dropped out of it today.

This is totally true.

So the very same logic results in:
If a topic in the sitemap section changed today and last modified in the chunk is today… the chunk changed today [note: not 3 days ago]. A topic in it changed today.

For your and my example above the implementation right now says:
sitemap-chunks sitemap_[2-5].xml changed 3 days ago. This is wrong. It should say ‘changed today’.

Here is the bigger picture behind all this:

sitemap_recent.xml:

  • Only includes all the changed topics from the last 3 days
  • Is renewed every 1 h (Internal Rails cache time of 1 h)
  • Has correct lastmod date in sitemap.xml

sitemap_[1-5].xml:

  • Really includes all and every topic, and also includes all the changed topics from the last 3 days
  • Is renewed every 24 h (Internal Rails cache time of 24 h)
  • sitemap_[2-5].xml have wrong lastmod date of 3.days.ago in sitemap.xml

The wrong lastmod date for sitemap_[2-5].xml does not matter, as Google will get all recent topic changes via sitemap_recent.xml in a timely manner.