Accidentally upgraded, now some settings are lost/broken

Full disclosure, I have not done any manual maintenance on our self hosted Discourse install in a long long time, and somebody else set it up originally.

I needed to change my SMTP credentials, because SendGrid is requiring a move away from basic auth to API keys.

I found this document: Recommended Email Providers for Discourse

Which says:

To change the current email service, run ./discourse-setup as well (this will bring the forum offline for a few minutes while it gets rebuilt).

I ran that command, answered the questions as expected, and this kicked off about 2,000 lines of output to stdout, ending with:

Upgrade Complete
----------------
Optimizer statistics are not transferred by pg_upgrade so,
once you start the new server, consider running:
    ./analyze_new_cluster.sh

Running this script will delete the old cluster's data files:
    ./delete_old_cluster.sh
-------------------------------------------------------------------------------------
UPGRADE OF POSTGRES COMPLETE

Old 10 database is stored at /shared/postgres_data_old

To complete the upgrade, rebuild again using:

./launcher rebuild app
-------------------------------------------------------------------------------------

cfd4df26701b4b4cd4a4202f30a9c8165a1ba609c921bffc25f250f52fee6cbe

Now, I was not expecting this to “upgrade” anything. I only wanted to change the SMTP credentials. But the site didn’t come back up automatically, so I did what the text says “to complete the upgrade” and ran:

./launcher rebuild app

This triggered another ~8,000 lines of output to stdout. The site eventually came back up, BUT it does not look the same:

  • My logo was missing, replaced with “Discourse” logo.
  • User avatar images were broken. Eventually, these just started working again.
  • Images in posts and category logos were broken. These are still not working. I looked for images with matching filenames in the ./discourse/share directory AND on our S3 bucket (where the URL is expecting to find them) but they do not exist.
  • Posts from my “support” category which I had hidden from the “latest” page is now visible on that page again.
  • My “support” category is no longer visible on the “categories” page.
  • The “categories” page now is 2 columns, with categories on the left and “latest” on the right. I think it was just a list of categories, before.
  • The colour of + in my + New Topic button has changed from white to grey.

At this point I suspect some configuration has been lost, but all my posts are intact. Then I suspect that Discourse has upgraded itself (from what verison I don’t know, probably many many versions), and some underlying default settings, CSS, templates, etc., have changed causing the problems listed above.

Viewing the source, I can see that the version is now 2.6.0.beta6 which was apparently released only 7 days ago.

So my questions are:

  • Is it normal to have to completely upgrade the software like this in order to apply a simple settings change, like SMTP credentials?
  • How can I change settings or apply security updates without upgrading the whole software?
  • Where are my images, or why has their URL changed in some way as to make them no longer accessible?
  • Is there any way to rollback, without losing new posts since all this happened? I don’t even know what version we were running before. I do have Discourse backups on S3 (just a gzipped SQL dump).
  • Do I just need to go through all the settings and customisation manually to fix things like the + New Topic colour, and hidden/visible “support” category?

I thought that we had been keeping EBS snapshots of our EC2 volume where Discourse is hosted, but that turned out to not be the case. I’ve since enabled snapshots so we can rollback that way in future if necessary.

Thanks.

Yes. It would have been possible to destroy and start the container to apply the new smtp settings, you’d need to look carefully to see how to do that.

No. The security updates are part of the whole software. You could run the stable branch, but it sounds like you were several versions behind, so at least most of your issues would still have happened.

That’s unclear. They aren’t usually lost in an upgrade.

No. [quote=“mrmachine, post:1, topic:171465”]
Do I just need to go through all the settings and customisation manually to fix things like the + New Topic colour, and hidden/visible “support” category?
[/quote]

Yes. It sounds like you were running a version several years out of date. Dozens of programmers have spent thousands of hours since then. They have made changes that broke some of your Customisations.

But you still won’t be able to roll back without losing posts since the last backup.

You should upgrade at every beta release, or switch to the stable branch at the next stable release.

Just following up on this. We had been storing uploaded images on S3 and had not chosen to download remote images to avoid broken links. I’m not sure if this is relevant, but the point is that images should have been stored on S3 and referenced with S3 URLs in Discourse.

After the upgrade, many many images were not showing. Category logos, and uploads by users in posts. Looking in S3, the referenced filenames did not exist.

Luckily, we have S3 versioning enabled and I was able to see in the S3 console that MANY referenced images had been deleted during/after the upgrade.

I adapted a Python script I found (https://stackoverflow.com/a/54613767/2829685) to iterate over all object versions in our S3 bucket and remote “delete markers” from the current version of any file with a last_modified timestamp equal to or greater than the date of the upgrade.

This took most of the day to run and undeleted around 45 thousand images. It looks like Discourse creates many thumbnails for each image.

Now our category logo and user-uploaded images in posts are back. But I have no idea how all these images could have possibly been deleted in S3 as part of the upgrade process.

Seems like this could be a very dangerous bug in the Discourse upgrade process? Although maybe it has already been fixed, as I was upgrading from such an old version.