Modifying the database inside of a backup to remove a duplicate key tag so that it doesn't fail when restoring

Drew-ART · August 22, 2023, 8:01pm

andy@ubuntu-s-1vcpu-1gb-ams3-01:/var/discourse/shared/standalone/backups/default$ file public-happiness-2023-07-25-033857-v20220628031850.tar.gz
public-happiness-2023-07-25-033857-v20220628031850.tar.gz: gzip compressed data, was "public-happiness-2023-07-25-033857-v20220628031850.tar", last modified: Tue Aug  8 14:53:40 2023, max speed, from FAT filesystem (MS-DOS, OS/2, NT)

The full file contents looks fine, though its difficult to say as it’s a lot of information: Hastebin

pfaffman · August 22, 2023, 8:09pm

I think it has to do with the file paths. Have a look at

github.com

discourse/discourse/blob/main/lib/backup_restore/backuper.rb#L218-L238


      
          
          tar_filename = "#{@archive_basename}.tar"
          
          log "Making sure archive does not already exist..."
          Discourse::Utils.execute_command("rm", "-f", tar_filename)
          Discourse::Utils.execute_command("rm", "-f", "#{tar_filename}.gz")
          
          log "Creating empty archive..."
          Discourse::Utils.execute_command(
            "tar",
            "--create",
            "--file",
            tar_filename,
            "--files-from",
            "/dev/null",
          )
          
          log "Archiving data dump..."
          Discourse::Utils.execute_command(
            "tar",
            "--append",

Hmm. That might not be the issue. I mostly don’t trust 7zip to create a compatible tar file, but that may be irrational.

  -h, --dereference
              Follow symlinks; archive and dump the files they point to.

The answer might be in the file above. Actually, it’s probably in another file in the same directory.

Ed_S · August 22, 2023, 9:11pm

Thanks - right contents, but under the wrong names, I think. Hence the problem.
You have

-rwxrwxrwx 0/0        26927534 2023-08-08 14:37 public-happiness-2023-07-25-033857-v20220628031850/dump.sql.gz

but the original and any unmodified backup would have

-rwxrwxrwx 0/0        26927534 2023-08-08 14:37 dump.sql.gz

Edit: so you need to drive 7zip just a little differently in building that tar.gz file.

Drew-ART · August 25, 2023, 4:31pm

Thanks for all your help. I’ve unzipped the files, edited the duplicate tag again and then very carefully rezipped paying extra careful attention to the file name, and there is progress!

Now when restoring i see this error message, which appears to be far more common:

[2023-08-25 15:25:21] CREATE INDEX
[2023-08-25 15:25:21] ERROR:  could not create unique index "index_tags_on_lower_name"
[2023-08-25 15:25:21] DETAIL:  Key (lower(name::text))=(socialmedia) is duplicated.
[2023-08-25 15:25:21] EXCEPTION: psql failed: DETAIL:  Key (lower(name::text))=(socialmedia) is duplicated.

I’d guess that means that i have changed the tag successfully, but there are still some occurrences of the tag in posts in my database. The tag_id number says there should be a tag called socialmedia but instead its finding a tag called socialmedia2 which is causing a conflict.

this post and this one discuss fixes, but as i only have access to my backup via directly editing the code on my local machine i’m not able to use the mysql tools to help clean it up.

Fortunately in my database i’ve only got 38 instances of 'socialmedia' (as oppose to 50,000+ socialmedia occurrences). Assuming I was correct changing the one on line 395421 as i screenshotted above, then I can’t see how to tell which remaining ones are linked to the ‘socialmedia’ tag, and which to the tag i’ve edited to ‘socialmedia2’.

Here’s an example of a fairly short post using the socialmedia tag

9488	'/groups/communitybuilders':86 '/groups/socialmedia':84 '/groups/webdev':89 '1st':117 '2022':131 '6':125 'activ':61 'banner':113 'btw':143 'close':169 'comment':21 'communiti':47 'communitybuild':87 'concept':4A 'especi':28 'event':119 'excit':164 'feedback':8B 'final':166 'get':38,133 'github':94 'grow':6A,142 'hack':127 'hard':156 'help':96 'homepag':151 'host':124 'improv':11B 'join':71,106 'launch':41,118,126 'like':128 'link':110 'live':140,175 'lot':27 'love':1A,67 'marvelxi':152 'mean':25 'media':51 'member':62 'mention':93 'move':45 'much':15 'new':150 'one':72,107 'onto':53 'plan':121 'platform':7B,43,139 'pleas':5A 'project':137 'promot':97 're':33,36,56,161 'readi':39,172 'rhorho358':23 'right':63 'see':100,167 'site':176 'slight':76,177,179 'small':58 'smile':77,178,180 'social':50 'socialmedia':85 'stage':31 'suggest':10B 'sure':79 'take':17 'team':59,75,103 'thank':12 'think':147 'time':19 'use':108 'webdev':90 'websit':3A 'whether':80 'work':155 'would':66,82	Thank you so much for taking the time to comment here @R , it means a lot, especially in the st... has been working hard on it and we’re all very excited to finally see it close to being ready on the live site :slight_smile: :slight_smile:	en_GB	4	f

I might be on the wrong track here though as that does look like more tags at the start than a user is likely to use in a post. Also possible that ‘socialmedia’ isn’t a tag used in the above post, although it should have been.

pfaffman · August 26, 2023, 1:49am

I would restore the database by hand and try to add the indexes and fix up the issues Wyeth the database rather than a text file, but that’s hard too.

merefield · August 26, 2023, 6:02am

I don’t think that’s the immediate conclusion you can/should be drawing. The issue should be straightforward:

The reason that index won’t build is you have at least two entries in tags table which resolve to the same thing when you take the lower case of their name. That’s what the error message is telling you.

So I think you still need to find the related entries in that single table that clash when going through that transform.

github.com

discourse/discourse/blob/1209efb63c47c32b119ea130e4c7d411a836e3e4/app/models/tag.rb#L274


      
          #  created_at         :datetime         not null
          #  updated_at         :datetime         not null
          #  pm_topic_count     :integer          default(0), not null
          #  target_tag_id      :integer
          #  description        :string
          #  public_topic_count :integer          default(0), not null
          #  staff_topic_count  :integer          default(0), not null
          #
          # Indexes
          #
          #  index_tags_on_lower_name  (lower((name)::text)) UNIQUE
          #  index_tags_on_name        (name) UNIQUE
          #

Also, Posts aren’t tagged, Topics are.

Before you delete the duplicate(s) note their ids because you will need to delete the related rows from the topic_tags table too (something you could have taken care of quickly if you’d performed all this maintenance online by simply restarting the container btw instead of destroying the instance!!).

github.com

discourse/discourse/blob/1209efb63c47c32b119ea130e4c7d411a836e3e4/app/models/topic_tag.rb#L56


      
              end
            end
          end
          
          # == Schema Information
          #
          # Table name: topic_tags
          #
          #  id         :integer          not null, primary key
          #  topic_id   :integer          not null
          #  tag_id     :integer          not null
          #  created_at :datetime         not null
          #  updated_at :datetime         not null
          #
          # Indexes
          #
          #  index_topic_tags_on_topic_id_and_tag_id  (topic_id,tag_id) UNIQUE
          #

Drew-ART · August 30, 2023, 8:33am

Our site is back! Thanks to everyone for your help.

It seems i had solved it days ago, but was careless with reading the error message. There were two duplicated tags, ‘socialmedia’ and ‘social-media’. After fixing the 1st i didn’t notice that the error message had changed, given how similar both duplicate tags were.

Here’s the processes for fixing those two errors:

1. Finding the tag table and the duplicate tag

Download the backup to your operating system. This guide is for windows, but the process will be about the same on Linux
Extract all the zipped folders in it which should leave you with a dump.sql file and an uploads folder
Open the dump.sql file with a text editor, i used Visual Studio Code
Search for “COPY public.tags” to locate the tags table. It should be near the bottom and look like this:

Either browse through it manually, or copy paste your tags table into a separate document where you can use the search function on it to find your duplicate tag
Save your fixed dump.sql file as dump.sql

2. The order and names of files and folders has to be perfect in the re-zip.

After extraction you should have a dump.sql file and an uploads folder
Right click on dump.sql. select 7zip and ‘add to archive’.
Select gzip as the archive format, keeping the name of the file the same
Select the new dump.sql.gz file and the uploads file, then right click > 7zip > add to archive > archive format: tar. Make sure the name of the file is exactly the same as the original backup, it should look something like this: ‘public-happiness-2023-07-25-033857-v20220628031850’
Select your new .tar file > 7zip > add to archive > archive format: gzip. Make sure the name of the file is exactly the same as the original backup, it should look something like this: ‘public-happiness-2023-07-25-033857-v20220628031850’
Your final result should be a .tar.gz file with the same name as your original backup.
Upload to the admin area and restore your backup

Drew-ART · August 30, 2023, 8:39am

There is one more place the tag is/might be repeated and that is the search data table:

COPY public.tag_search_data (tag_id, search_data, raw_data, locale, version) FROM stdin;

I’m unsure if that also needs to be corrected or not.

pfaffman · August 30, 2023, 10:21am

So glad you finally got it fixed!

Topic		Replies	Views
SQL Error during import Installation	19	2279	May 3, 2015
Can't restore due to corrupt indexes (with some clues on how to deal with corrupt indexes) Installation	13	4623	February 5, 2020
Problems while updating from 3.0.0 to 3.0.1 Installation	16	1181	June 14, 2023
Problem with restore Discourse from backup (3.4, quite large DB) Support	18	193	January 14, 2025
Error restoring backup "key is duplicated" Installation	8	1781	August 31, 2017

Modifying the database inside of a backup to remove a duplicate key tag so that it doesn't fail when restoring

Related topics