Hosted to self-hosted migration: past uploads still reference discourse infra

Checked: Images lost when migrating to self-hosting, posts:rebake does not do anything good.

Problem
We followed the official instructions and created a Lightsail instance, from there we did a database download from the Discourse UI and applied it to get 80% there. The idea was to transition to the self-hosted instance whilst keeping the previous variant alive.

Once we have a live copy of the old forum. We begin transitioning the images. To do so, we first cancel our subscription to get and migrate our images.

As new images would be uploaded to the self-hosted instance, we would only need to upload from the hosted instance prior to the transition date. This means that we never used the database dump that came with our images and cancellation; as we’d already done the migration, it was now expired.

I observe three behaviors related two this point in time.

  1. Referenced resources in the backup (SQL dump, specifically) points to Discourse infrastructure
  2. Referenced resources* since created on the backup, for example new posts’ images, are properly referenced and found on our infrastructure

Consequently, If I reupload a resource which evaluates to the same hash, it will link to Discourse infra. For example: trying to fix favicon by uploading the same does not work. I can however upload any other random image, and it will work.

Current state
As I understand, the upload://<X> goes through b62 decode (and sha1?) bits to map it for the folder public/uploads. We have every one of those images:

The dump we were provided by the Discourse team contains a zip with default/original/1X and it currently can be seen in /var/www/discourse/public/uploads/default/original/1X. The latter folder now contains 329 items, the given dump contained 249 items—that sounds good to me.

That means that the data should be discoverable, even if I cannot directly find the upload in the folder. I am looking to understand this relation, so that I can somehow fix the mapping. Initially it only seemed like a simple string substitution, and that did work for some images. Some are now however been replaced by a transparent.png, where before it was just an inaccessible image…

If the rebake failed then you should try a remap to search/replace all references to the Discourse infrastructure and replace them with relative links.

Thank you Richard!

To clarify, by: Replace a string in all posts

Using

rake posts:remap["find","replace","string",true]

do

rake posts:remap[
  "https://cdck-file-uploads-europe1.s3.dualstack.eu-west-1.amazonaws.com/standard21/uploads/everviz/",
  "/uploads/default/"
]

The alternative replacer to relative would be `“https://forum.everviz.com/uploads/default/

Is the relative link what you’re thinking of?

e: correction of relative url with /

Oneliner:

rake posts:remap["https://cdck-file-uploads-europe1.s3.dualstack.eu-west-1.amazonaws.com/standard21/uploads/everviz/", "/uploads/default/"]

looks good to me! you’ll want to add a slash in front

/uploads/default/

Did you check include all uploads while taking the backup from your hosted site? If you were hosted with CDCK, there used to be a hidden setting they need to enable before you can take a backup with all the uploads included. I’m not sure if that has changed now but you definitely want to coordinate with your hosting provider before making the move to ensure you’re taking a complete backup (and not just a dump of SQL)

My hosting provider was Discourse, we were on a monthly plan. The hosted user interface says to contact team@discourse.com to get uploaded files. Their response was that I need to cancel the subscription in order to get the files.

But yes, as mentioned I received uploads/original/1X

It’s a good tip, but I might’ve done it already:

root@...:/var/www/discourse$ rake posts:remap["//cdck-file-uploads-europe1.s3.dualstack.eu-west-1.amazonaws.com/standard21/uploads/everviz/","/uploads/default/"]
Are you sure you want to replace all string occurrences of '//cdck-file-uploads-europe1.s3.dualstack.eu-west-1.amazonaws.com/standard21/uploads/everviz/' with '/uploads/default/'? (Y/n)
Y
Remapping
0 posts remapped!

The links used to be https://europe1.discourse-cdn.com/standard21/uploads/everviz/ in the hosted forum. This is of course the same stuff, only gated through the CDN. Let’s try remapping.

1 post remapped.

I find this image to be curious:

Of course, this was before running all these commands done today and before posting here. It was sent to the Discourse team before I run some rake tasks and such.

Did you do that? They have to turn on a hidden setting that will download images from S3 and include them in your backup. A normal backup does not include the images but just links to them on their S3 buckets. Canceling a subscription triggers that automatically, I think, but I have had clients who got the setting turned on simply by asking. You should either cancel you subscription or ask again.

If you don’t want to do it that way then you will need to write a script that will download the images from S3 and update the Discourse database accordingly.

I did cancel and received the files. Although it seems the original backup of the discourse database reference the path in S3. Essentially, I have everything I need in /var/www/discourse/uploads/original/1X.

I used a manually downloaded SQL dump to populate the instance, not the one provided with the files. I was concerned that maybe the latter provided corrected paths to images, which I have now verified to not be the case.

To demonstrate:


![](upload://3Qa5S9sUTcc42dT4EFAbz5K0iJP.gif) = 1aec065017da50538fe5866ae91a6396185234e1.gif

https://forum.everviz.com/uploads/default/original/1X/1aec065017da50538fe5866ae91a6396185234e1.gif

http://cdck-file-uploads-europe1.s3.dualstack.eu-west-1.amazonaws.com/standard21/uploads/everviz/original/1X/1aec065017da50538fe5866ae91a6396185234e1.gif

<img src="https://forum.everviz.com/images/transparent.png" alt="" data-orig-src="upload://3Qa5S9sUTcc42dT4EFAbz5K0iJP.gif" role="presentation" width="1" height="1" style="aspect-ratio: 1 / 1;" loading="lazy">

The above is a special case where the previous reference to cdck… is just transparent.png. Regardless, you can open the link and see that it exists.

So I would expect to have problems.

In what I assume is a raw post that you included, with the database included with the files, I would expect the ![](upload://3Qa5S9sUTcc42dT4EFAbz5K0iJP.gif) to refer to your local storage, but if someone explicity pasted in a link to an image on their bucket, that something would need to be done to fix it. If the image existed and you have the download-to-local setting on, that the image from the bucket would get downloaded (given that it met setting criteria).

I’m not quite sure how the last <img in your sample could have been generated.

Download to local is enabled.

For the linked file, the ‘official’ goodbye dump does not include relative paths.

<img src="https://europe1.discourse-cdn.com/standard21/uploads/everviz/original/1X/1aec065017da50538fe5866ae91a6396185234e1.gif" alt="" data-base62-sha1="3Qa5S9sUTcc42dT4EFAbz5K0iJP" ...

This exact file reference also points to cdck… in some places

It sounds a little bit insane to me, but I could do a backup now. And then discard references to the Discourse infra for the local path in the dumpfile itself, and reupload that.

The last file might reference transparent.png because I have recooked the post, and the source file was not discoverable anymore in Discourse infra. I don’t think we’re looking at a complete data loss.

If your site is live, then you would just go in and fix stuff in rails, to the degree that it’s possible.

But that <img is a cooked post, right? Not the raw post?

The <img is from the database dump. I presume cooked. The raw post references the b62 as upload://

The current cooked is:

<img src="https://forum.everviz.com/images/transparent.png" alt="" data-orig-src="upload://3Qa5S9sUTcc42dT4EFAbz5K0iJP.gif" ...

Thus far I haven’t been very successful with rake for find and fix missing_uploads, remapping, and rebaking posts.

Thank you Jay for all your help!

The file that is referenced in the cooked post works. There’s no problem with that.

If you are looking in the cooked posts in the database dump, then you are looking in the wrong place.

You have a live site now, so you need to work from there.

What do you see in the raw post? After a rebake of that post, what does the cooked post show that is not what you expect?

Without knowing exactly what you did, and what’s in the posts (raw and cooked) there isn’t much way to help. Since you started with a database that is expected not to have to right data in it, this topic isn’t going to be useful to others.

I did what everyone told my not to do: to meddle with the database and its dumpfile. Currently, most everything works, except for same cases of:

<img src="https://forum.everviz.com/images/transparent.png"
alt="image" data-orig-src="upload://npqpp5O0wbL89nR9OXtP7Btu4hc.png"
width="517" height="90" style="aspect-ratio: 517 / 90;" loading="lazy">

Let’s compute the b62 and take its hex

npqpp5O0wbL89nR9OXtP7Btu4hc = 0x a411c90267cafca7a1cbcd7c8f4f9b8db17e51ba

Now try to find it from /var/www/discourse/public/uploads:

find . -name '*a411c90267cafca7a1cbcd7c8f4f9b8db17e51ba*'
./default/original/1X/a411c90267cafca7a1cbcd7c8f4f9b8db17e51ba.png

Yes!


But why is it transparent.png in the post? I’ve done rake uploads:recover_from_tombstone and rake posts:rebake


How did I get here?

The uploads column in the database, for table url would still show cdck as part of the source URL for images. I dropped into the database from inside the container:

postgres psql discourse

Then

UPDATE uploads
SET url = REPLACE(
           url, 
           '//cdck-file-uploads-europe1.s3.dualstack.eu-west-1.amazonaws.com/standard21/uploads/everviz/', 
           '/uploads/default/'
         )
WHERE url LIKE '//cdck-file-uploads-europe1.s3.dualstack.eu-west-1.amazonaws.com/standard21/uploads/everviz/%';

This showed promising results, where most original images and thumbnails would reappear.

One step further: modifiying dumpfiles

The assumption is that Discourse is stateless* and the only thing we need to care about is what is inside the database. I was not eager to fiddle around with rake tasks or ruby for this, as I am not very familiar with either of those or Discourse internals. I just want results, fast.

*short of the public folder which contains our images. We can nevertheless confirm that we have everything we need.

So we download a copy of the database from the UI, then open it in VSCode and regressively substitute cdck (bucket) references and europe1 (bucket from CDN) references

By regressively I mean that in some instances you would see ‘//…’ and in other cases you would see ‘https://’. Therefore you need to match and replace ‘//…’ first, else you will have trailing https: across the file.

Then reupload the modified dumpfile. Part of what made this all tricky is the base62 step, which makes it a little harder to go from a raw representation to the actual image URL.

Task complete

After double checking the size of the uploads table, I noticed that we were missing some hundred entries. I don’t know from what step they went missing. I merged to the database backup from the past with a basic SQL join from a temporary table.

As I might’ve referenced above, the URL that is requested for an image is that which is stored in the uploads table, url column. From the rails console I remapped these CDN references to our local domain with SQL over the uploads table.

Why not use the rake task

There is probably a few that is OK and some composition of then would work. Howevever, when you can observe the current behavior, you know what you want, and you know how to get there—then I find the limitation to be arbitrary

I want to thank the Discourse team and voulenteers here which have all given me the pieces of information I needed to discover the solution, which did end up consisting of some steps.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.