Fix quotes after phpBB import

It should for the most part be possible to modify the regex

~* '\[quote[^]]*]'
.....
regexp_replace(posts.raw, '\[(/)?quote([^]]*)]', E'\n[\\1quote\\2]\n', 'g')

so that it can find (the ~* case insensitive “LIKE” line) and replace (the table.field, pattern, replacement, flag line) characters and combinations of characters.

The trick is knowing exactly what to match so that it will both match what you want it to and not match what you want it not to. If the pattern is always and only the same characters it can be matched more easily. But maybe at times too easily eg. it is easy to match “cat” but you wouldn’t want to change “category” to “egory”.

As for trailing content a pattern something like ‘character string(.)*’ (google “regex everything atom”) should work as long as the string doesn’t include any “special” characters.

As for the font and color bbcode tags, if these are in only raw and not causing a problem with the cooked, I wouldn’t bother worrying about them. If they are a problem, changing the string “quote” to “font” and then “color” and changing the replacement to an empty string should work if you only want to remove the tags entirely. If case is a concern, changing the flag to ‘gi’ should work. Test First!

Documentation:
https://www.postgresql.org/docs/9.3/static/functions-matching.html

2 Likes

Obviously, the “Sent from my iPhone” strings have nothing to do with posts_with_bbcode_quotes, so how should this be structured?

The posts_with_bbcode_quotes is only an arbitrary name I gave it. It could just as easily have been something different. Some prefer shorter names such as “p”, but I prefer to use names that are more descriptive so I can easily know what they are and recognize them when I see them. In other words, there is no “magic” with that name.

For “sent by iphone” it could be named posts_with_iphone_string and the pattern could be simply the string "Sent from my iPhone”.

For simple string matches regex is probably overkill and
LIKE '%Sent from my iPhone%'
would be sufficient and a bit more efficient, but as this is hopefully only a one-time thing with only a relatively small number of posts I don’t think the difference would be significant.

Where regex will be needed however is in the replace since some text that is unknown will need to be “captured”

1 Like

Ah! I was looking for such a table in the Discourse docs. :smiley:

The regex for capturing the string and everything behind it is
Sent from my iPhone\s*([^]*)

I don’t know where you found that pattern, but it looks off to me. It doesn’t error when you try it?

See it here:
https://regexr.com/3k2mg

I have not worked it into Data Explorer yet to test.

1 Like

There’s also this version:

Sent from my iPhone[^]]+

Edit: this one works, but I will need to add a stop at <img src so that attched images are left intact.

These last links are to code sharing sites? I get the impression you do not understand how regex works and are using one example that works in one situation and substituting bits and pieces in hopes it will work in a different situation. Going about it this way will lead to pain. Regex isn’t really a language where “trying different characters without understanding what they do in hopes it will work” is a good idea. It is much more a “trying different characters that I understand how they work and testing to make sure they do”.

As I posted previously it is very important, I’d say essential, that you know exactly what text you will be working with and exactly what you want to do with it. Making best guesses will rarely if ever work as hoped.

They are Regex builder/testers, where I can use the actual content from the forum to test and tweak the expression.

OK, here’s my translation for this pattern
Sent from my iPhone[^]]+
the string “Sent from my iPhone” followed by one or more characters that are not the closing square bracket character.

I used this for the bbcode quote tags because it was a relatively safe assumption that the next closing square bracket after the string [quote would be that specifying the end of the tag. Not a 100% safe assumption, a tag like [quote hash="ft3fk]dr"] would have broke and for something critically important that possibility would need to be addressed.

As for removing “this string and everything after it”, despite my aversion to using the “regex everything atom” because of the problems it can cause, in this case it is probably appropriate to use it. But if you do not want to remove “everything after it” and want to keep certain things the pattern could get extremely complex. For example, this for “most” phone numbers
Phone Number Regular Expression Validation - JavaScript - The SitePoint Forums

^((\+\d{1,3}(-| )?\(?\d\)?(-| )?\d{1,3})|(\(?\d{2,3}\)?))(-| )?(\d{3,4})(-| )?(\d{4})(( x| ext)\d{1,5}){0,1}$

Unless you do want to remove everything after the string “Sent from my iPhone” manually editing the posts individually, in spite of how unattractive that may seem, is likely to result in less time spent and less errors.

That should have been a < instead of ] to look for the start of <img but it doesn’t take into account other possible occurrences of <, so it is not a good one.

I’ve been experimenting with lookarounds, and this does what I need:

(?=Sent from my iPhone)(.*?)(?=<img src)

Example:

I am going to spend some time testing for weird cases.

A quick count of “Sent from my iPhone” and “—Original Message—” returns about 1k occurrences each, and there are other similar strings in various languages. It wouldn’t be much fun to do them manually. :pick:

Hi @omarfilip,

I have completed my import from IPB recently with help of @pfaffman

I found, however, using rake commands to slow for fixing quotes and stuff as my forum has 1.3 mln posts it does take time.

I prefer to do all of the stuff directly on the database and then to rebake.

For the quotes for instance I would do:

UPDATE posts SET raw = REPLACE(raw, '[quote=', ''||chr(10)||'[quote=') where raw like '%[quote=%';
UPDATE posts SET raw = REPLACE(raw, '[/quote]', '[/quote]'||chr(10)||'') where raw like '%[/quote]%'; 

and then rebake.

I would actually fix everything I can before I even start rebake as you can see you changes live by simple regenerating HTML or editing the posts.

Hope it helps!

3 Likes

This worked good for me on an SMF forum. I wish it detected whether line breaks were already there, but I can live with the extra line breaks.

There is a mention plugin installed on my SMF forum so there’s a bunch of bbcodes that look like this:

[member=4953]username[/member]

I think I just need to replace the [member=4953] part with the @ symbol and then delete the [/member] closing tag. I couldn’t figure out how to modify the quote regexp_replace to work on these member mention tags.

Any help here would be great appreciated. Thanks!!!

Don’t you need a line break before and after each opening or closing quote tag? I think your solution only adds one before the opening tag and after the closing tag, which gets you halfway there. For example, the closing tag would have to be:

||chr(10)||[/quote]||chr(10)||

Note: I was confused by the single quotes in the replace string so I left those out.

Also, what do you mean by doing it directly on the database? Weren’t they already doing the UPDATE directly on the database? Thanks.

It’s a matter of what syntax you prefer.

The Ruby console using ActiveRecord syntax. Or the Postgres console using PostgreSQL syntax. They both have the ability to modify the database. And they both have the ability to break the database. IMHO this is something that should be not experimented with on a live site, available backup or no.

I strongly recommend you set up a testing environment and work with that before doing anything with a live site’s database.

1 Like

Oh ya, I’m definitely doing all this on a Vagrant box. I am comfortable with SQL queries, I just am somewhat new to Ruby/Rails so I didn’t realize there was a different console you could use on the database besides psql.

Would really appreciate your help with the mentions tags I described a couple posts up. I was able to come up with this regex which matches the member tags:

\[(\/)?member([^]]*)]

As demonstrated here: https://regexr.com/3lavf

It was giving me an error on RegExr if I didn’t escape the forward slash. Yours wasn’t escaped in the quote query so I guess it’s not necessary. Now I just need to figure out how to replace the opening tag with the @ symbol and delete the closing tag. Might need two separate queries to accomplish that.

There are some differences between different “flavors” of regex, eg. PCRE vs. POSIX. Anyway, in this case, when inside of a character class, the closing square bracket does not need to be escaped as long as it is the first character in the class.

The (\/)? capture group is “zero or one” backslash and will match both opening and closing bbcode tags. If you want to treat opening and closing bbcode tags differently, (replace opening, remove closing) then I think two separate regex would be the simplest (if not necessary) way to do it.

2 Likes

I did not find this necessary. One before and one after did the trick.
The easiest way to experiment is as mentioned by @Mittineague create clone where you can break things, you can do that easily if you use something like digital ocean.

Then do an experiment on one post only. Go to its edit box and you will see straight away if you changes are taking expected effect.

I have to say that, for some reason, I had a hard time getting regex to work on my Postgres database. Maybe that is actually a problem with IDE - Datagrip from Jetbrains.

I’ll test it and post here when I get it to work.

2 Likes

Hi,

Ok the easiest would be to break that into 2 separate replace like so:

UPDATE posts set raw = regexp_replace(raw, '\[member=(\d){0,4}]', '@', 'g') where raw like '%member=%';
UPDATE posts set raw = replace(raw, '[/member]', '') where raw like '%[/member]%';

Remember that sometimes depending on your enviroment you would have to escape special characters.

I would suggest you use a playground like this one to understand what is going on first:

https://regexr.com/

There is also a nice tutorial here:

https://regexone.com/lesson/introduction_abcs

For instance here is what I used for old emoticons that were still present on my board from old times:

update posts set raw = REPLACE(raw, 'B)', ' :slight_smile: ') where raw like '%B)%';
update posts set raw = REPLACE(raw, ':LOL:', ' :wink: ') where raw like '%:LOL:%';
update posts set raw = REPLACE(raw, ':wub:', ' :heart_eyes: ') where raw like '%:wub:%';
update posts set raw = REPLACE(raw, ':0', ' :open_mouth: ') where raw like '%:0%';
update posts set raw = REPLACE(raw, ':o', ' :open_mouth: ') where raw like '%:o%';

and some regex to clear up old user imports of whole html pages…

UPDATE posts set raw = regexp_replace(raw, '\[/?(size|font)=?“?"?\d{0,2}”?"?\]', '', 'g')                       where raw like '%size%' or raw like '%font%';
UPDATE posts set raw = regexp_replace(raw, '\[font=“?"?(\w+)(.(\w+))*“?"?\]', '', 'g')                          where raw like '%font=%';
UPDATE posts set raw = regexp_replace(raw, '\[/(font|color|background|center|i)\]', '', 'g')                    where raw like '%/font%' or raw like '%/color%' or raw like '%/background%' or raw like '%/center%' or raw like '%[/i%';
UPDATE posts set raw = regexp_replace(raw, '\[(center|i)\]', '', 'g')                                           where raw like '%[center%' or raw like '%[i%';
UPDATE posts set raw = regexp_replace(raw, '\[/?(left)\]', '', 'g')                                             where raw like '%left%';
UPDATE posts set raw = regexp_replace(raw, '\[/?(color="?#?).{0,8}"?\]', '', 'g')                               where raw like '%color=%';
UPDATE posts set raw = regexp_replace(raw, '\[(background=rgb).{10,15}\]', '', 'g')                             where raw like '%background=rgb%';

Hope it helps someone!

6 Likes

Thanks for the regular expressions!

To clarify, this is only true if the quote itself is one line. If there are any line breaks inside the quoted text (ie multiple paragraphs), then the opening and closing quote tags need to be on their own line. So a line break before and after each tag is necessary some of the time.

2 Likes