Publishing DB backups without exposing users private information

At bitcoincashresearch.org we are looking to offer a certain level of decentralization of the forum, including the possibility that anyone who considers it necessary could rebuild a given instance of the forum in another domain.

For that to be possible, we should publish DB’s backups periodically. But we face the problem of exposing users’ private information, such as emails, IPs, etc.
We could remove that private information from DB’s backups manually, but doesn’t seem to be the smartest or neatest option, even more with periodical backups.

Are you aware of any solution that could work in that regard? Or some example of something similar that has been done?

That’s a pretty wacky edge case!

Do you’d want to remove IPs and email addresses? Without email addresses, there would be no way for users to reclaim their accounts (if they used a password, they could log in, but then they would not be able to put their email address back in, as there would be no way to validate the change).

I don’t see any way to create a useful backup that didn’t have at least email addresses.

1 Like

Here are public SQL dumps of a Discourse forum: Index: if-archive/info/intficforum

You could get some ideas from there.

3 Likes

I agree that it could seem like a wacky edge case. But the idea behind it doesn’t seem so much since it works towards avoiding forum’s centralization.
I understand that this may not be very important for some forums, but it may be for others.
I also agree that removing users and passwords (among other information) from the backup would prevent those same users from logging into the new instance.
That’s why I said that it didn’t seem like the smartest way.
I should probably rephrase my question.
Is there a known way or recommended procedure to publish database backups so that anyone can rebuild a given instance of the forum without exposing users’ private information?

I would like to add some context. There is a very specific need / use case.

There is an operational discourse that has important content and the content will become more important over time. Because of the nature of the ecosystem that uses it, avoiding central points of failure is something we have learned is very important.

With an export today, if you exclude authentication info, it should be possible to publicly publish the full site content but as @pfaffman pointed out, you end up with an irreversible break where users can no longer authenticate and the exported site becomes read-only.

Therefore I think what Leandro needs is a feature in discourse that allows users to login through cryptographic challenges rather than traditional account/password schemes. Then in the export, only include that part of the account - none of the other email, password hashes, etc. In the alternate copy of the site, now users who took advantage of the feature can login and go through an email/standard account recovery procedure.

When doing that full publication, it will be obviously very important not to include any of the traditional account authentication information like emails and password hashes, etc. It’s so important that for any version of discourse with this feature, the sensitive info should be kept in a separate place from the rest of the site data so that it is impossible to accidentally export.

I hope that gives a little more context to chew on.

3 Likes

Also these changes are obviously very non-trivial. It would be good to hear feedback, issues and alternatives. Maybe resources can be pulled together on our ecosystem side to create a fork that implements the idea.

1 Like

That can be done by adding support for webauthn as a passwordless authentication method like explained Webauthn support.

That plus a service to cleanup the backup file from the fields you don’t wanna expose.

4 Likes

That’s another solution to login.

Oh, and regular users don’t have to approve email address changes if they’re logged in, so a dump with email addresses removed would be ok for all users who should log in using whatever credentials are in the database (password is the easiest).

A plugin that either stripped the email addresses, or encrypted the email addresses (I think I know how to do that with reasonable ease) come solve the problem.

In a plugin I’ve encrypted some fields like this:

I think that it might be possible to override the UserEmail model similarly and encrypt the email addresses. There isn’t much code in the UserEmail model and I suspect it changes very infrequently, so it might not be too dangerous a change. Or it might not work at all.

Filtering out IP addresses might be a bit more tricky, as I think it’d be hard to override the user model. For that you might create a plugin that removes those IPs one way or another.

2 Likes

@Falco and @pfaffman thanks a lot for your feedback and advice.

We will investigate Webauthn to see if we can follow that path, same for your plugin @pfaffman.
I will get back to you in a couple of days with some comments, questions, or conclusions.

3 Likes

Another possibility on the horizon is the use of a Augmented PAKE (Password-Authenticated Key Exchange), so that the password is both practically unrecoverable from the database and never touches the wire.

Unfortunately, they’re all still solidly in the realm of experimental cryptography, and not ready for easy deployment. iOS iCloud sync uses a PAKE.

3 Likes

If you want to keep the email/password login, but in a way that anyone can enter the account of any user in the backup’ed database, you may achieve that generating an email based on the username for every user, like <username>@email.invalid.

For passwords and logins, assuming that discourse uses encrypted passwords with salts (I haven’t looked, but I assume so), you can define a password like 123456 (in your live database), see in the database the resulted encrypted password as well as the salt (then change your password back, or do it in a fake account), then run an instruction in the new (cloned) database to define the encrypted passwords and salts of every user to the values you saw before (every user will have the same encrypted password as well as the salt, and, consequently, the same password, the one you used before), so for a user foo, you could login with the email foo@email.invalid and password 123456.

Other than that, you might want to delete the private messages (if they are not needed), because they may have sensitive data.

And finally, it’s good to take a look for fields that may have confidential data, like the (admin) settings.

2 Likes