Fix quotes after phpBB import

Thanks. And the count is now 12146?

Count from this query is 12146:

WHERE cooked LIKE '%[quote%'

Sorry, I should have been more clear. I meant this query.

WITH posts_with_bbcode_quotes AS (SELECT posts.id AS targeted_posts 
             FROM posts 
             WHERE posts.cooked ~* '\[quote[^]]*]'
             ORDER BY posts.id)
SELECT COUNT(DISTINCT(posts.id))  
FROM posts_with_bbcode_quotes, posts 
WHERE posts.id IN (posts_with_bbcode_quotes.targeted_posts)

This is the one I plan to use to workup the UPDATE query.

That returns a count of 12143

Gadzooks is this a bugger. I hate to proceed without knowing what the problem is so I’ll sleep on it in hopes it comes to me (sounds odd I admit, but there have been countless times that dreams and half-sleep have let me see solutions I couldn’t while in wakefulness)

On a positive note, the UPDATE query is coming along nicely and I should be able to test it with my limited number of posts. Hopefully worse case the query will work and leave only the three enigmatic posts unfixed for you.

1 Like

Ha! 0.0002470 failure rate. Oh the horror! :slight_smile:

3 Likes

I’ve put together this that looks OK with my contrived posts, but I would like you to run it against your posts, changing the LIMIT to something you think would be relatively sufficient.

WITH posts_with_bbcode_quotes AS (SELECT posts.id AS targeted_posts 
             FROM posts 
             WHERE posts.cooked ~* '\[quote[^]]*]'
             ORDER BY posts.id)
SELECT regexp_replace(posts.raw, '\[(/)?quote([^]]*)]', E'\\n[\\1quote\\2]\\n', 'g')
FROM posts_with_bbcode_quotes, posts 
WHERE posts.id IN (posts_with_bbcode_quotes.targeted_posts) 
LIMIT 5

For mine, without a LIMIT I get

All on one line Some text \n[quote]\nInline syntax\n[/quote]\n end text. Preview does not look good. 

Inline with leading text only Blah blah \n[quote]\nInline syntax\n[/quote]\n Preview Not OK

Inline with trailing text only \n[quote]\nInline syntax\n[/quote]\n and more Preview Not OK

Block starting after text Some text \n[quote]\n Block level \n[/quote]\n preview Not good

Block with text after \n[quote]\n Block level \n[/quote]\n hey now! preview not good

inside text on same line as opening tag \n[quote]\nBlock variant one \n[/quote]\n preview NOT good

inside text on same line as closing tag \n[quote]\n Block variant two\n[/quote]\n preview NOT OK

This is the eighth broken post. Two more bad syntax quotes This is \n[quote]\nthe first\n[/quote]\n And this is \n[quote]\nthe second\n[/quote]\n Ditto on the not.

Ninth post, First broken and with fake attr val, second with fake attr val but not. Broken \n[quote id="666xxx"]\nBlock level \n[/quote]\n I see it \n[quote name="testing"]\n now I \n[/quote]\n don't

Nested quotes, both non-conforming syntax, both with attributes. The outer \n[quote name="outer"]\n To quote the great Gatsby \n[quote name="inner"]\n Give me the money\n[/quote]\n\n[/quote]\n ditto Bad

Nested quotes, outer good, inner bad \n[quote]\n Shouldn't that be \n[quote]\n I have the money\n[/quote]\n instead? \n[/quote]\n Just saying.

With a limit of 50, it seems to work fine:

[quote] —>  \n[quote]\n
[/quote] —> \n[/quote]\n
[quote="Peter"] —> \n[quote="Peter"]\n
1 Like

BTW, I’ve set up a copy of the live instance on a separate DO droplet where I can test and restore backups if things don’t work right. The db size is ~400MB, so it’s rather quick.

2 Likes

I’m certainly glad I tested this. From how I understood the documentation newlines needed to be double escaped. Not so. They needed to be only single escapes else they were written as literal text. I am fairly certain there are better ways to know the number of affected rows, but I do not know how to do them. Anyway, it looks good to me. a couple before and afters:
newline1newline2newline3newline4

To recap caveats for those that might be interested in this query

  • It does not match against the lack of newlines around improperly formatted bbcode tags in raw post content
  • It relies on Discourse displaying the bbcode tags in the cooked content to identify problem posts
  • it does not take into account any posts where displaying of bbcode tags in the cooked content is desired (eg. in code fences)
  • it matches per post, not per bbcode tag. Therefore if a post contains both improper and proper syntax bbcode tags, all will be surrounded by newlines
  • it does not create edit revision histories
  • posts require a rebake for the changes to be effected in cooked
  • it can not be stressed strongly enough, Save a backup before running this query
  • the RETURNING raw line will show the number of affected rows after outputting all of the raw content. Though acceptable with my small sample set, this may be undesireable for larger UPDATEs
WITH posts_with_bbcode_quotes AS (SELECT posts.id AS targeted_posts 
             FROM posts 
             WHERE posts.cooked ~* '\[quote[^]]*]'
             ORDER BY posts.id)
UPDATE posts 
SET raw = regexp_replace(posts.raw, '\[(/)?quote([^]]*)]', E'\n[\\1quote\\2]\n', 'g')
FROM posts_with_bbcode_quotes 
WHERE posts.id IN (posts_with_bbcode_quotes.targeted_posts) 
RETURNING raw;
6 Likes

Run in the pg console, correct?

Regarding rebaking, would matching a pattern work, or is the pattern we’d be looking for already changed?

I did mine post by post using the post admin wrench to Rebuild HTML Obviously doing this for 11 posts is a lot different than doing it for thousands.

AFAIK the command would be
rake posts:rebake
without the match pattern part.

2 Likes

Running commentary:

12143 rows affected.

Next up: rebake…:timer_clock: (looks like 2 hr)

While the rebake task is cooking… what’s the proper way to avoid paged output in the pg console? I tried \pset pager off but still had to go through some pages.

I have no idea. But I would like to know both how to page and how to not page. There have been times I was interested in seeing everything and the output quickly scrolled by and other times I was less interested in seeing all the output but had to page which was a slight annoyance.

Perhaps when done you could run the “12146” WHERE cooked LIKE '%[quote%' query again and have a look at the three mysterious posts?

1 Like

Success! However, my computer went to sleep and the session had a broken pipe, so I think the rebake process was shut down at that point.

The WHERE cooked LIKE '%[quote%' query now returns 450.

I’ll rebake again and provide an update shortly.

1 Like

Run

tmux

Before you run the command. If you get disconnected, run

tmux attach

to reconnect

4 Likes

Rebaking is complete, but the count remains at 450. Looks like those are unpaired opening or closing bbcode quote tags.

They would have most likely have been unpaired in phpBB when a user deleted one of the tags while trimming the quoted text.

I don’t have an answer for the 3 mystery posts. There were 4 posts that failed rebaking.

I hadn’t thought of unpaired tags. 450 is quite a few, but I would likely slog through them in reverse order, manually fixing each. But then, I can be stubborn determined.

It would be possible to have script deal with them, but script would not be able to judge whether the unpaired tag should be removed, or if its partner should be added, and if so, in what particular place.

In the next step of the cleanup process I want to delete everything in a post below strings like ----Original message----, or Sent from my iPhone, etc. Some of the leftover tags will get removed in that process leaving even fewer for manual editing.

Thanks very much for your dogged effort @Mittineague !

I imported my forum from Mybb. I’ve had the same problem as you. rake posts method have solved most of the problem. But rake posts method can’t delete multiple value expressions like [color=#] or [font=] tags. I want to delete all color tags whatever the value is. How can we adapt this code for that? And where should we run this code?