Auto purge uploads from old deleted posts

Discourse already automatically removes orphan unreferenced uploads. Why not expand this functionality and erase uploads from deleted posts? Only staff members are able to see deleted posts and that is very useful. But is it really necessary to keep all the files indefinitely? As administrator I really don`t care about 2 years old deleted pictures. Some of which may even be right out against site guidelines.

It would be nice if Discourse would automatically delete these kind old files and place some kind of text block in the place of deleted file. So that it would be evident that file deletion has taken place.

For example all year old uploads which are referenced only in deleted posts would be automatically deleted.

8 Likes

I would really like this as well.

It seems like this feature is somewhat implemented:

Settings I use:
clean orphan uploads grace period hours: 1
purge deleted uploads grace period days: 1

A deleted upload does not include the case of a post being deleted with the image contained within it though. The image/upload needs to be edited out of the post first prior to deletion I believe.

I can confirm that images won’t be deleted if the (only) post containing that specific image has been deleted, as I have images that still exist in the database and s3 from a post that was deleted in 2023 (with the image not being used in any other posts). I’ve never had it delete them for previous cases also.

So if a mod deletes a post due to having an uploaded image that is against the rules, to really delete it they need to edit it out of the topic/post first (and hope it doesn’t exist in any other posts). Otherwise it will exist on S3 indefinitely, at least from my understanding.

Some features that would be really great:

  • purge deleted uploads grace period days - Either have this setting include the case of an image being contained inside a deleted post, or add another setting for that case.

  • purge deleted uploads grace period days - Use hours instead of days. Copyright removal requests generally need extremely prompt action to be taken, within 24-48hrs. 1 day is way too slow for this case. Any CDN cache liekly needs to also be manually purged after it has been deleted also, making the timeline even tighter.

  • Being able to delete/purge an image from the dashboard. Though, if purge deleted uploads included images inside deleted posts, this would be less necessary, but there are still cases like the image being used as an avatar, or profile banner etc. as well as being more efficient for mods. Feature suggestion: Image removal/purge via web dashboard

  • Make the image urls searchable, this would cover the case of a moderator being able to find all topics/posts that contain a specific image in order to delete those posts also. Without needing to use SSH.

  • Ability to ban certain hashes from being uploaded would be a nice touch.

It would be nice because then processes like this could be handled by someone without SSH access and technical skills. Especially because of how quickly these need to be handled. It’s prohibitively expensive to need technical staff ready 24/7 to handle any case like this that comes up, including on every holiday, weekends, when someone is sick etc. You can’t predict when it will occur, so therefore need to always be ready to handle one promptly at any time. It’s an inescapable attribute of UGC.

1 Like

I wonder why this isnt more widely discussed? This is really a big problem.

I have written a PHP script, which uses a CSV file generated from the following SQL query which lists all uploads and their references:
(increase the limit if you have lots of uploads)

SELECT 
    uploads.original_filename,
    ROUND(uploads.filesize / 1000000.0, 2) AS size_in_mb,
    uploads.extension,
    uploads.created_at,
    uploads.url,
    upload_references.upload_id,
    upload_references.target_id,
    upload_references.target_type,
    upload_references.created_at,
    upload_references.updated_at
FROM upload_references
JOIN uploads ON uploads.id = upload_references.upload_id
ORDER BY upload_references.target_type
LIMIT 90000

Now what the script does is, it filters uploads, which remain only as drafts (which falsely remains in the database as I explained here). The script outputs space seperated string with all the file names. You can also modify the script to output the full path (remove the function basename() ).

Then login to your discourse SSH server and execute rm command for all the files.

  • One downside to this is that all images, which remains in active drafts will be deleted too (but this can be limited by lowering delete drafts older than n days).
  • Second downside - the incorrect database entries still remains, for that I should ask to the devs for a fix.

If the incorrect entries are deleted, the issue should be properly fixed.

<?php
 
if (($open = fopen("test.csv", "r")) !== false) {
    while (($data = fgetcsv($open, 100000, ",")) !== false) {
        $array[] = $data;
    }
 
    fclose($open);
}
$final = array();
$i=0;
foreach ($array as $item){
	if($item[7]=="Draft"){
			foreach ($array as $item_inside){
				if(($item_inside[4]==$item[4]) && ($item_inside[7]!="Draft")) $i++; //taisa i++, kad nav tikai drafts
				}
			if($i==0)array_push($final, $item[4]); //bija tikai drafti, var likt masiivaa
			$i=0;
		}
	}
$final_unique= array_unique($final);
//print_r($final_unique);

foreach($final_unique as $single){
	echo basename($single)." ";
	}
?>

File test.csv containing the query should be placed in the same directory as the script.
If you have any problems, ask me!

2 Likes