That’s more complicated.
Lord, did this turn out to be true!
I took a run at this. Now, understand our Discourse instance has like 15-20 years worth of mailing list archives imported into it. So there are a lot of old posts that look like this…
…where you get a ton of links, and email replies that quote the entire previous email repeating those links, and email signatures with links to unrelated things, etc.
So at some point, the topic_links
table screwed up, and started making new rows for links it already had:
So there might be a row like:
69534 | 18675 | 86333 | 6631 | http://lists.libsdl.org/listinfo.cgi/sdl-libsdl.org | lists.libsdl.org | f | | 2017-03-26 12:13:06.365742 | 2017-03-26 12:13:06.365742 | f | 0 | | SDL Info Page | 2017-03-26 13:18:08.491592 | f |
With that row created in 2017. But then I’ve got hundreds of dupes across all these posts, where the system came along in 2024 and added duplicate rows with id values like 200000 higher.
So it’s a lot to clean out.
Basically I’d wait for the reindexing to fail, see what it spit out, find the ID for the duplicate row, and adjust the URL string to no longer duplicate…it looks like the system already deals with this by inserting **
into the string semi-randomly. I did the same. I was afraid to delete the rows outright, because I have no idea what references these rows elsewhere.
It did seem to be a handful of topics, presumably different posts on the same topics, because I was fixing the same handful of URLs over and over. It was tedious, I don’t recommend it. I probably should have found a way to automate this.
It’s also worth noting that the “CONCURRENTLY” option on REINDEX will build a temporary index (with a _ccnew*
string appended to the name), and if the process fails, it’ll leave it there. If you’re going to use CONCURRENTLY and have to clean up lots of things, expect to drop a LOT of temporary indices by the time you clean everything out. (Also, if like me, you have never used Postgres directly, also know I had to issue a \connect discourse;
command at the start before the REINDEX command could be used.)
I’m still in the process of doing this; it doesn’t seem to be making things worse, so I’m continuing to plug along on it.