Cleaning up uploads and purging uploads from S3

:bookmark: This is a reference guide describing how orphaned and deleted uploads are automatically purged from a Discourse site. This guide applies to both self-hosted and hosted Discourse sites.

:person_raising_hand: Required user level: Administrator

Have you ever wondered what happens to files and images that were uploaded to a Discourse site but are no longer referenced, or how to remove uploads from a site? You’re in the right place!

You may need to delete files and images that are uploaded to Discourse that are no longer reference. There isn’t a built-in way to do this from the user interface, however, Discourse does have an automatic Sidekiq job scheduled to remove orphaned and deleted uploads called clean up uploads.

Orphaned and deleted uploads

:information_source: Orphaned uploads are files that have been uploaded to a Discourse site but are no longer referenced. An upload is considered orphaned if and only if it’s not referenced:

  • In the latest version of a post
  • In a draft
  • In a queued post
  • In a logo site setting
  • In a custom emoji
  • In a theme
  • In a user avatar/background/card image
  • In a category logo/background image

:information_source: Uploads are considered “deleted” when the topic/post they are contained within is deleted.

Cleaning up uploads

To fully remove an upload from Discourse, you’ll have to do one of the following:

  • Force the upload to become orphaned by removing any reference to the upload. This can be done by editing the upload link out of the post that it’s in, or any other places the upload may be referenced.
  • Delete any and all topics/posts containing the upload, causing the upload to be considered as “deleted”. Note that you need to remove the image from the post before deleting the post to orphan it.

All orphaned unreferenced uploads and deleted uploads will then be removed from storage (after a grace period) once the clean up uploads job runs.

Site settings

The following site settings are available at example.discourse.com/admin/site_settings/category/files for modifying how Discourse automatically purges uploads.

  • clean up uploads: default true
  • clean orphan uploads grace period hours: default 48
  • purge deleted uploads grace period days: default 30

The clean up uploads setting can be used to enable or disable the automatic deletion of orphaned unreferenced uploads. The clean orphan uploads grace period hours and purge deleted uploads grace period days are the two settings that control how long after a detected orphaned or deleted upload is purged and permanently removed from the site.

Additional details about the clean up uploads job are available in the clean_up_uploads.rb file on GitHub.

Purging S3 uploads

:warning: The following section is only applicable to self-hosted Discourse sites.

:information_source: If you are currently hosted on our Enterprise Plan, please reach out to team@discourse.org if you have any questions about deleting uploads from your S3 storage.

Cleaning up orphan and deleted uploads works similarly for both local and S3 storages. The only difference between the local storage and S3 storages is that the cleanup of the S3 uploads is automatically handled by S3 via a tombstone policy. See Managing your storage lifecycle for additional details about how this is handled on S3.

By default, the clean up uploads job includes S3 uploads. However, if you would like to disable this feature, you can uncheck the s3 configure tombstone policy site setting.

Last edited by @hugh 2024-07-26T01:29:25Z

Check documentPerform check on document:
4 Likes

Is this accurate? I think you need to remove the image from the post before deleting the post to orphan it.

Also, should this last one be purge deleted uploads grace period days?

1 Like

Enabling “clean up uploads” sounds scary with the warning message. When converting an existing forum to discourse this setting will be disabled. Not all import scripts will properly register all the uploads in posts, so if you enable it, you might lose a lot of attachments.

With the following query you can check if uploads are properly referenced by the posts:

select p.post_id, u.id as upload_id
from (select id post_id, (regexp_matches(cooked, 'data-download-href=[^\s]+/default/([a-z0-9]+)', 'g'))[1] upload_sha from posts where raw like '%upload://%' order by created_at) as p 
join uploads u on u.sha1 = p.upload_sha
where not exists(select * from upload_references r where r.upload_id = u.id)

That should not return any rows if everything is correct. If you use this query in the Data Explorer plugin it will also neatly link to the posts which have unreferenced attachments.

If the above query does return results you can fix the missing upload references with the following query:

insert into upload_references(upload_id, target_type, target_id, created_at, updated_at)
select u.id, 'Post', p.post_id, u.created_at, u.updated_at 
from (select id post_id, (regexp_matches(cooked, 'data-download-href=[^\s]+/default/([a-z0-9]+)', 'g'))[1] upload_sha from posts where raw like '%upload://%' order by created_at) as p 
join uploads u on u.sha1 = p.upload_sha
on conflict do nothing;

You will need direct database access in order to make the corrective change.