phpBB importer fails to import anonymous users differing only by letter case

When importing anonymous users from phpBB I get the following error:

Error on record: {:email=>“anonymous_no_email_Надежда”, :username=>“111104”, :name=>"", :created_at=>Mon, 10 Nov 2014 15:09:15 UTC +00:00, :active=>true, :trust_level=>0, :approved=>true, :approved_by_id=>-1, :approved_at=>2016-07-11 10:40:21 +0000, :import_mode=>true, :last_emailed_at=>2016-07-11 10:40:21 +0000}
/var/www/discourse/vendor/bundle/ruby/2.3.0/gems/rack-mini-profiler-0.10.1/lib/patches/db/pg.rb:90:in `exec’: PG::UniqueViolation: ERROR: duplicate key value violates unique constraint “index_users_on_email” (ActiveRecord::RecordNotUnique)
DETAIL: Key (lower(email::text))=(anonymous_no_email_надежда) already exists.
: INSERT INTO “users” (“email”, “username”, “name”, “created_at”, “active”, “trust_level”, “approved”, “approved_by_id”, “approved_at”, “last_emailed_at”, “username_lower”, “updated_at”) VALUES (‘anonymous_no_email_Надежда’, ‘111104’, ‘’, ‘2014-11-10 15:09:15.000000’, ‘t’, 0, ‘t’, -1, ‘2016-07-11 10:40:21.494781’, ‘2016-07-11 10:40:21.599263’, ‘111104’, ‘2016-07-11 10:40:21.622126’) RETURNING “id”

It seems that we have guest posts under the names “НАДЕЖДА” and “Надежда” - it’s the same name, the first one merely being all-caps. As the importer doesn’t treat anonymous user names as case-insensitive, it tries to create two accounts here (probably not wrong). Trouble is, the generated email address violates uniqueness requirements which treat email addresses as case-insensitive. Looking at the source code, this issue doesn’t seem to be limited to Cyrillic, it would be just the same with Latin.

I’m running the import script from Discourse 1.5.3, master seems to have the same issue however. Given that treating account names as case-sensitive is desired, maybe the simplest solution is adding a hash of the username to the generated “email address”? This patch seems to solve the issue:

diff --git a/script/import_scripts/phpbb3/importers/user_importer.rb b/script/import_scripts/phpbb3/importers/user_importer.rb
index ca50197..3f91a52 100644
--- a/script/import_scripts/phpbb3/importers/user_importer.rb
+++ b/script/import_scripts/phpbb3/importers/user_importer.rb
@@ -47,11 +47,13 @@ module ImportScripts::PhpBB3
     end
 
     def map_anonymous_user(row)
+      require 'digest/md5'
       username = row[:post_username]
+      md5 = Digest::MD5.hexdigest(username)
 
       {
         id: username,
-        email: "anonymous_no_email_#{username}",
+        email: "anonymous_no_email_#{username}_#{md5}",
         username: username,
         name: '',
         created_at: Time.zone.at(row[:first_post_time]),
1 Like

Duplicate email addresses are not allowed. This should be sanitized before it gets to the importer code.

2 Likes

There are guest posters - they have no email addresses (simply because they don’t have any account on phpBB). These “email addresses” are fake and have been generated by the importer.

I’ll look into this… I’m not sure if this is the best way to fix that issue. I guess anonymous usernames should be case-insensitive…

2 Likes

Well that’s wrong. Hostnames are case-insensitive, but usernames in email should be case-sensitive.

https://tools.ietf.org/html/rfc5321#section-2.3.11

An address normally consists of user and domain specifications. The
standard mailbox naming convention is defined to be
“local-part@domain”; contemporary usage permits a much broader set of
applications than simple “user names”. Consequently, and due to a
long history of problems when intermediate hosts have attempted to
optimize transport by modifying them, the local-part MUST be
interpreted and assigned semantics only by the host specified in the
domain part of the address.

Key point: the local-part MUST be interpreted and assigned semantics only by the host specified, which means you are not allowed to make any assumptions (like case sensitivity) about it.

1 Like

No, we’re not having that discussion again, thanks very much. In practical terms, emails are not case sensitive.

(For the record, I used to hold your position. Now I very much… do not.)

4 Likes

Sure the patch proposed seems fine, nonsense email vs nonsense email with md5 makes no diff

4 Likes

Not sure about that - it will actually change the name displayed next to posts. Consider one anonymous user calling himself “guest” and another who calls himself “GUEST” - both would be the same after import. Not a huge loss but still…

  • Discourse doesn’t support usernames that differ by case. So, one would be called “guest” and the other “GUEST1”.
  • Everyone can post as “guest” in phpBB if anonymous users are enabled. There’s no way of knowing if three posts by “guest” actually were posted by the same person or three different people. And who knows, maybe “guest” and “GUEST” are the same person…
  • Importing anonymous users from phpBB is just a simple feature in order to not have a bunch of posts by “system”.

How come that my import worked then? It seems that the username field is filled with a numeric value for former anonymous users. I currently have three users with essentially the same name:

discourse=> select id, username, name, email from users where lower(email) like '%надежда%';
   id   | username | name |                            email                            
--------+----------+------+-------------------------------------------------------------
 106024 | 111145   |      | anonymous_no_email_надежда_ed2ee7915ec20077dec081f89fb3b026
 105979 | 111103   |      | anonymous_no_email_НАДЕЖДА_8f9097b8a900315c958aacbe0a531bf6
 105980 | 111104   |      | anonymous_no_email_Надежда_4e6c0123d6375b565d1cba7d8eb3b831
(3 rows)

All of them could be the same person of course. But the displayed name didn’t change compared to phpBB.

Well, they all got a different username, in your case all numbers (111145, 111103, 111104), because Cyrillic usernames are not allowed. Only the name

And don’t get me wrong, I’m not saying your solution doesn’t work. I’ll definitely try it out and implement a proper fix…

1 Like

Ah, didn’t realize that Cyrillic was important here. You are right, other anonymous users got proper usernames - and those would be treated as case-insensitive. Surprisingly, it seems that we’ve had only one such conflict despite 17k anonymous users…

This is now fixed:
https://github.com/discourse/discourse/pull/4350

2 Likes