User data corruption in phpBB3 to WP/Discourse migration

Occasionally and seemingly right after a fresh import of user data from our old phpBB3-based site, we’re seeing an issue where some user data gets corrupted with another user’s data during the WP-to-Discourse synchronization. It doesn’t happen often, and isn’t repeatable on demand, which unfortunately has led our dev team to largely write it off as an issue.

In the first case, one of my test user accounts was removed as part of a fresh import of data, but that test user’s avatar was then assigned to a different user’s profile and I was logged in as that user once the import completed.

In a second case, I registered a test user in WP and when the synchronization with Discourse happened, this test user took an existing users username in Discourse and some of their custom profile and group data. See screenshot…

In both these cases, duplicate users in Discourse were involved in the test user’s account corruption. Ex: agmolnar and agmolnar1 and tbm960c and tbm960c1

We had a number of these duplicate users which I gather most likely came from anonymous users on the imported phpBB3 data file.

Has anyone seem something similar before or have any hints as to what the problem might be here? Would it be worthwhile to have our team scrub the phpBB3 import file of anon users before doing a fresh import?

Thanks for any suggestions.

Hi Ryan,

You might have a look at these files and see if you can find some useful information:

Also:

And:

Maybe you can modify this part not to import anonymous users, or find some clue that would lead to your weird issue.

Thanks Coin-coin. If we scrub anon users from the user data file before importing, I assume all anonymous posts will be assigned to the ‘system’ user as they are anyway. Is there any reason I may not be aware of to keep the anonymous users from our phpBB instance in the import file?

I’m gathering from this and other threads such as…

The issue for us seems related to the anonymous users, which are essentially duplicate users that do not exist in WP, a fresh import to WP which changes the structure of WP user IDs but not Discourse IDs by eliminating newly-created WP test users, and the fact that Discourse tries to associate users first by external WP ID.

When we do a fresh import, it removes some WP test users from the database. User IDs of a couple of anonymous phpBB users that I searched for all appear to have the highest integer user IDs in Discourse (4505, 4506, etc). So it seems when we run a fresh import and delete test accounts in WP, a newly created user in WP then gets synced by the old ID of the now non-existent old test user.

By forcing users to be matched by email instead of external ID during the fresh import, as described in post #5 linked above, we should be able preclude the possibility of any unwanted merging of old and new users.

Does this all sound reasonable?

Thank you

A problem with this is that post attribution to unique users will be lost, so it will be hard to follow a conversation where all posts seem to come from a single system user - you won’t be able to tell distinct participants apart from each other.

Exactly; conceptually, you could either leave things as is and manually use the admin user profile UI to merge users that should be the same ones, OR you could simply send email as external ID instead as you suggest, which will new logins to be connected to an existing account with a matching email address.

The latter is clearly the least resistance path :+1:

@kiefferr I noticed some details in the images you included and removed them in case you didn’t meant to share them. Please feel free to re-upload images if needed. :slight_smile: :+1:

1 Like

Are you saying that you are running both the discourse and phpbb3 communities at the same time and periodically import data from phpbb3 again?

That’s what it seems like at this point as our dev of a new WP/Discourse site drags on, but no, Discourse is still in dev. We are replacing a custom site that was built around phpbb3.

We did an initial import and then decided to do another test import ahead of a final import before cutover. I’m glad we did, because I wouldn’t want to be chasing these bugs down on a live site.

1 Like

Hi guys, I have a followup if you would be so kind. I shared the above findings with the dev team regarding forcing users to be matched by email instead of external IDs during the final import/migration right before we go live, as well as deleting SSO records so they can be re-propogated cleanly.

But they now seem to think there was something wrong with the original phpBB3 export data, namely that there are duplicate/anon users in that data as well as some phpBB users without associated emails. The following all seems like something Discourse should be able to handle in terms of importing phpBB3 data. Am I wrong? Especially regarding how Discourse assigns fictitious usernames to anon users, this is standard operating procedure.

If we need to obtain cleaner data from our current phpBB3 install, we can probably do that. But it doesn’t seem like we need to or should mess with the phpBB data. That wasn’t really the problem.

We performed a new installation of Discourse and successfully imported data from phpBB.

Here are our findings:
The “…_users” table contains a total of 3270 records.
Upon downloading the imported users from Discourse, we observed that there are 3251 users in Discourse.
During our analysis, we discovered that several users have an appended “1” in their Discourse usernames, which actually originated from their usernames in the phpBB data. Only one specific user is “redacted_username1,” which does not exist in the phpBB data. However, the user “redacted_username” is present in the phpBB data.

The email associated with the Discourse username “edacted_username1” is “anonymous_52996ba94025464fdf3e5f3ae131bdf5@no-email.invalid.” This suggests that the username “redacted_username” was already taken by an anonymous user, who then added “1” to their username.

To prevent such occurrences in future imports, we need to obtain fresh data that excludes anonymous and other unwanted users.

The user count discrepancy between phpBB and Discourse is 19. Within the phpBB data, there are 53 users who lack an associated email for their accounts.

When searching for “anonymous” users in Discourse, it returns 32 users with anonymous emails who are currently suspended. The usernames assigned to these anonymous users cannot be found in the phpBB data. This implies that Discourse is assigning fictitious usernames to anonymous users, which could potentially cause future errors.

In summary, 19 users without email were not imported, while 32 users were synchronized as anonymous emails with fabricated usernames.

Please share clean PHPBB data so we can import it.

Furthermore, if you have any thoughts on it please let us know.

Unless you are going to delete those users from the database, I would modify the script to ignore those users. Getting a partial dump is likely not to be easily repeateble.

If you can just delete those users from the current site, that seems like a good solution. My preference in situations like yours is too read the live database directly rather than repeatedly transferring a dump.

1 Like