Fix quotes after phpBB import

No. Is it necessary to remove the bbcode tags? In the posts where I manually just add a return to ensure the tags are on a line by themselves works as a fix.

[quote]
some text
[/quote]

produces

[quote]foobar[/quote] to <br><br>foobar<br><br> would lose the blockquote.

But, [quote]foobar[/quote] to <blockquote>foobar</blockquote> would be OK even if the attribute is lost.

1 Like

Now I’m a bit confused. Please correct me if I’m off base on any of this.

  1. posts from phpBB were imported to Discourse and became raw post content
  2. all of the bbcode tags are well-formed (paired and nested correctly)
  3. because the phpBB bbcode tags were not in proper Markdown syntax (inline instead of block level), they were not changed to Discourse blockquote tags and the cooked post content displays them as bbcode tags
  4. you don’t want to remove the bbcode tags from the cooked post content, you want to reformat the raw content to conform to Markdown syntax
  5. you want to keep the text content that is between the start and end bbcode tag pairs
  6. if necessary you are willing to do a rebake

or

  1. you want to replace the bbcode tags in the cooked post content with blockquote tags
  2. you want to keep the text content that is between the start and end bbcode tag pairs
  3. if necessary you are willing to repeat the process if the raw post content reappears for some reason
1 Like

The best solution is to fix the importer and import again. I’m guessing that this solution doesn’t help you, but for others finding this thread, make sure that your import is what you want before going live.

4 Likes

You are not wrong @Mittineague. Both 1-6 scenarios would fix the current state of quotes.

I agree with @pfaffman that ideally this problem would be handled by the import script.

If I’m correct, changing the cooked content runs the risk of the raw content replacing the cooked content at some point (eg. global rebake) which could be annoying. Changing the raw content might not update the cooked content and require running a global rebake.

I know that raw content can be accessed using URLs like domain/raw/topic_id but I do not know if the json / rss URLs use raw or cooked content let alone what plugins might be using.

In any case, short of fixing the import script and running it again I’m leaning towards changing the raw content as a better approach even if it means rebaking needs to be run.

The first thing I would do is run this query in Data Explorer and manually edit the 4 non-all-lowercase bbcode tags and make them lowercase and block level instead of inline, only to get them out of the way.

SELECT id AS post_id
FROM posts 
WHERE cooked LIKE '%[QUOTE%' 
OR cooked LIKE '%[Quote%' 

AFAIK, clicking on the Data Explorer id result should bring you to that post.

EDIT
I just tested and I needed to SELECT topic_id to get a link to the topics and alias as post_id to get the excerpts

SELECT topic_id 
  , id AS post_id
FROM posts 
WHERE cooked ~* '\[quote[^\]]*\]' 

2 Likes

Thanks @Mittineague. The counts now match after finding and manually fixing the 4 non-all-lowercase bbcode tags.

I’ve already moved, merged, and cleaned all sorts of stuff that was left over from the old phpBB forum, so re-importing is no longer practical.

I am absolutely OK with that approach. :ok_hand:

3 Likes

I created a topic in my localhost to test against. I hit a bit of a snag that might be a concern if you have posts where you want the bbcode quote tags to show inside of code fences unchanged.

Specifically, I had been working up some badge queries for “writer/essayist/novelist”, the idea being to grant them to members that made longer posts. The alpha queries were matching posts that had very little original text but with large amounts of quoted text. I didn’t want that so I put a test for [quote in the query. The query was put inside code fencing in workup topics. Because I use triple backticks out of habit I could use this in a WHERE to filter out those posts. But if any had been inside single backticks or in indented code it would have be more complex. IMHO adding newlines before and after the tags wouldn’t render the display indecipherable but the change is something you should be aware of. eg.

This won’t [quote]work[/quote]

would be changed to

This won’t
[quote]
work
[/quote]

Does your forum have posts that might include similar code examples?

2 Likes

No, I don’t recall seeing such a pattern, nor would it be a concern if it existed and was changed as you described above.

Good, you may not need the “```” AND line in this query.

WITH posts_with_bbcode_quotes AS (SELECT posts.id AS targeted_posts 
             FROM posts 
             WHERE posts.cooked ~* '\[quote[^\]]*\]' 
             AND posts.raw !~~ '%```%'
             ORDER BY posts.id)
SELECT COUNT(posts.id) 
FROM posts_with_bbcode_quotes, posts 
WHERE posts.id IN (posts_with_bbcode_quotes.targeted_posts)

It looks more verbose, but I’ve found that using qualified columns significantly speeds up query execution and I prefer readability over terseness. (ie. table.column vs. column, targeted_posts vs. pid).

It’s a bit redundant ATM, since it’s not an UPDATE yet, but the count should hopefully be the same.

The count with this query is 12143 whereas the count with the previous query is 12146.

However, the counts match when running the latest query in data explorer and pg console - 12143 for both.

You may have three posts that have both bbcode quote tags and three backtick code fences. Please run this and check the posts.

WITH posts_with_bbcode_quotes AS (SELECT posts.id AS targeted_posts 
             FROM posts 
             WHERE posts.cooked ~* '\[quote[^\]]*\]' 
             AND posts.raw ~ '%```%'
             ORDER BY posts.id)
SELECT posts.topic_id AS topic_id 
  , posts.cooked
FROM posts_with_bbcode_quotes, posts 
WHERE posts.id IN (posts_with_bbcode_quotes.targeted_posts) 
LIMIT 50

* there should only be 3, the LIMIT is just in case for safety sake.

1 Like

Zero with this latest one.

Hmmm. that means there aren’t any posts that have bbcode code tags and three backtick code fences.

The only other difference I can think of is that the regex changed from \[quote to \[quote[^\]]*\]

This would translate to

an opening square bracket followed by the text “quote”

and

an opening square bracket followed by the text “quote” followed by 0 or more characters that are not a closing square bracket followed by a closing square bracket

the “not” there to account for possible attributes / values.

If you try this is the count the same as the one with the “\]” in it?

WITH posts_with_bbcode_quotes AS (SELECT posts.id AS targeted_posts 
             FROM posts 
             WHERE posts.cooked ~* '\[quote[^\]]*'
             ORDER BY posts.id)
SELECT COUNT(posts.id) 
FROM posts_with_bbcode_quotes, posts 
WHERE posts.id IN (posts_with_bbcode_quotes.targeted_posts)
WHERE posts.cooked ~* '\[quote[^\]]*\]' 

returns 12143

WHERE posts.cooked ~* '\[quote[^\]]*\]' 
AND posts.raw ~ '%```%' 

returns 0

WHERE posts.cooked ~* '\[quote[^\]]*'

returns 12146

Well, it’s only three, but I am curious to know what it is about them and it would be good to know just in case not dealing with them might cause some negative effects.

Try this and see if you can spot what is different about those three.

WITH posts_with_bbcode_quotes AS (SELECT posts.id AS targeted_posts 
             FROM posts 
             WHERE posts.cooked ~* '\[quote[^\]]*' 
             AND posts.cooked !~~ '\[quote[^\]]*\]' 
             ORDER BY posts.id)
SELECT posts.topic_id AS topic_id 
  , posts.cooked 
FROM posts_with_bbcode_quotes, posts 
WHERE posts.id IN (posts_with_bbcode_quotes.targeted_posts)

This produces a list of many posts, not just the 3. From the query plan:

  CTE posts_with_bbcode_quotes
    ->  Sort  (cost=41813.41..41831.73 rows=7325 width=4)
          Sort Key: posts_1.id
          ->  Seq Scan on posts posts_1  (cost=0.00..41343.20 rows=7325 width=4)
                Filter: ((cooked ~* '\[quote[^\]]*'::text) AND (cooked !~~ '\[quote[^\]]*\]'::text))
  ->  CTE Scan on posts_with_bbcode_quotes  (cost=0.00..146.50 rows=7325 width=4)
  ->  Index Scan using posts_pkey on posts  (cost=0.42..4.55 rows=1 width=372)
        Index Cond: (id = posts_with_bbcode_quotes.targeted_posts)

Hmmm, TBH I have no idea why that wouldn’t work for you. Of course my localhost doesn’t have as many posts so that might have something to do with it.

Please try putting a LIMIT 1 at the end of the query and see if you can spot a possible issue in the post.

EDIT
I just tried and found one that looks like this
11,65,"[quote=""ModGuy1, post:9, topic:16""] Even though
But I don’t know why it doesn’t match the “\]”

Nothing unique about the resulting post. It contains multiple quotes, like this:

[quote]Some text

Some more text[/quote]

[quote]Even more text

And the end of text!![/quote]

I looked through the first 5 posts that are returned by the query, but could not find a pattern that would set them apart from the rest.

I reviewed the Postgres pattern matching documentation and the problem is due to a mistake. I apologize for the wasted time and effort. I was erroneously escaping where I shouldn’t have been escaping.inside of the character class. Thus “[^\]]” was equivalent to “not a backslash or closing square bracket”. Though I did not test to confirm I am reasonably certain that the string contained invisible whitespace i.e. “\s” or “\n” characters that would break the intended pattern. Try this

WITH posts_with_bbcode_quotes AS (SELECT posts.id AS targeted_posts 
             FROM posts 
             WHERE posts.cooked ~* '\[quote[^]]*]'
             ORDER BY posts.id)
SELECT COUNT(DISTINCT(posts.id))  
FROM posts_with_bbcode_quotes, posts 
WHERE posts.id IN (posts_with_bbcode_quotes.targeted_posts)
2 Likes

This returns a count of 12143.

I am grateful for your effort in this @Mittineague, and I enjoy learning as we go along.

1 Like