Problematic IP address fields

Continuing the discussion from Providing data for GDPR:

I did a first pass over Discourse’s tables, and I found several places where IP addresses are being accidentally correlated with user IDs. This is toxic data generating liability for Discourse forums.

List of problematic IP address fields:

  • :white_check_mark: :x: incoming_links: stores timestamped IP address correlated with user ID and an exact post ID, topic ID, and Referer: header
    Fixed Storage: PR#5826
  • :white_check_mark: :white_check_mark: search_logs: stores timestamped IP address correlated with user ID and exact search term
    Fixed Storage, Retention: PR#5851
  • :white_check_mark: topic_link_clicks: stores timestamped IP address correlated with user ID and clicked link
    Fixed Storage: PR#5852
  • :white_check_mark: topic_views: stores timestamped IP address
    Fixed Storage: PR#5850
  • user_profile_views: stores timestamped IP address

Non-problematic IP address fields:

  • user_auth_tokens, user_auth_token_logs: stores timestamped IP address correlated with user ID and device identifier
    • Data purged periodically, but :warning: there should be a conditional notice in the Privacy Policy if verbose auth token logs are enabled
  • screened_email, screened_url, screened_ip_addresses: Only created when a user is banned for being a spammer.
  • api_keys: List of IPs inputted by admin.

Filing this as a privacy-bug.

8 Likes

You mean like this?

I am unclear what you mean by “accidental”?

  • incoming links are from a remote client that may or may not be logged in, in which case IP is relevant

  • searches are from a remote client that may or may not be logged in, in which case IP is relevant

  • outgoing links are clicked by a remote client that may or may not be logged in, in which case IP is relevant

I’ll also remind you the definition of bug, which is listed here,

A bug report means something is broken, preventing normal/typical use of Discourse

and recategorize this.

Correct, but from what I can tell the IP is being saved even if you’re logged in. The IP is shown nowhere in the UI and has no data retention policies attached (a .cleanup! method, etc).

That’s why I called that a bug but the topic_view not.

Although it’s not ‘broken’, this is something that will be preventing normal use of Discourse, since storing IP addresses without consent or a good reason is going to be illegal in a significant part of the world really soon.

Just for my understanding, how is storing the IP address relevant for incoming links and searches?

I do understand the outgoing links case, (although using a cookie would be a cleaner solution).

4 Likes

Yes, just like the EU’s cookie law prevented use of websites without a cookie notice. Total jurisdictional destruction, worldwide, immediately preventing use of every single website without a cookie notice. :woman_facepalming:

If the argument is that the IP isn’t necessary, that is fine, but filing it as a bug will be met with extreme resistance.

2 Likes

Actually. Maybe you don’t get to see them, but almost every EU-based site that I visit for the first time does show a cookie warning.

Companies wanting to use Discourse are asking (us) questions about this, and they will choose forum software that is compliant, simply because it’s a checklist item for the legal department.

Don’t forget that GDPR is pretty forgiving: as long as you communicate well and as long as you have a good use for things, you can get away with a lot. So that is why I am genuinely asking what those IP addresses are used for in case of incoming links and search.

3 Likes

I do understand that it is hard to feel the sentiment in Europe here if you’re not actually in Europe. Maybe a good comparison is to look how a lot of US based online services are responding to SESTA right now: some of them are even completely or partially closing down, just because they don’t know how to comply.

Well, GDPR is not causing that much panic, but people do want to be sure that they’re compliant. And as long as they’re not sure, they’re not even starting a forum.

4 Likes

I am happy to strip IP address for logged on users for those 3 cases, I don’t see it as adding any extra value anyway cause we have user_id.

If we want a log of all “historic” ip addresses a user had we need a different table for that. Storing IP addresses at random for logged on users is pointless.

Seems like and easy change to me, make ip_address nullable on the table, and then scrub.

I feel this whole line of argument is dangerous and bad.

  1. Remove IP address logging from X tables because it is pointless and adds no value and random liability.

VS

  1. Remove IP address logging for table because reasons.

(1) is a much much stronger and valuable argument that applies universally. GDPR is intended to protect privacy, demonstrate how privacy is potentially impacted and then make a case for the change

15 Likes

I feel so too, it was more like a sidestep (that’s why I made a separate post).
It was merely a response to the EU Cookie Law being brought into this discussion.

Back on the topic -

It’s not about scrubbing them, it’s about not storing them in the first place.


I’m still wondering for the reason for IP addresses for incoming links and search logs?

Scrub relates to the migration that add the feature, new rows should either store user id or up, not both

2 Likes

And even if you are in Europe, you only feel the sentiment if you are in a big org or are active within online marketing, recruiting etc.
But within these fields, it is VERY much a topic du jour. I just counted 20+ meetups about GDPR in April within a 2 hours drive: https://www.eventbrite.com/d/netherlands–amsterdam/gdpr/

If discourse has an official page, with a big fat green checkmark next to GDPR, that could be quite good for adoption.

3 Likes

“Not storing them in the first place” is a forward-looking change, “scrubbing” is removing the historical data.

Same as the TopicViewItem code, it’s used as a uniqueness measure so only one per day is counted from the same IP.

Could probably use hashed IPs, too.

2 Likes

Not for rolling up IP bans, we can’t. You need the actual IP for that.

But for view counting, click counting, search logs… you absolutely can.

Basically everything except the users table and the screened_ table could be hashed IPs.

And api_keys because that’s configuration, not logging

1 Like

Possibly, but unless you are volunteering to do the engineering work and regression testing for free, there’s substantial work there for marginal upside.

Probably easier to focus on the low hanging fruit here of stuff that’s easy to do and moves us toward the goal.

2 Likes

Hashed IPs can be brute forced fairly easily (by calculating the hash of all 2^32 IPs and finding the one that matches the relevant hash in the database). Maybe there isn’t much difference between storing a real IP, and a hashed IP?

But what if a secret salt is included in the hash? And the secret salt is forgotten & replaced with a new secret salt each month? Then old IP hashes from within the same month, could still be compared with each other. Let’s say some accounts were created a few years ago (edit: or months ago), and now suddenly become active and start misbehaving. Then one can see if they likely belong to the same person, and lookup even more accounts by that person, by looking up the ip hash. But, since the salt was forgotten, it would no longer be possible to “reverse” it and find the real IP. Both privacy and a bit security, at the same time.

(Probably not a good idea to do now, because would be complicated. But maybe good to have in mind … if starting to think about this again … some years later)

1 Like

I actually doubt that an IP address has a lot of identifying value nowadays, with mobile internet, wifi and dynamic IP’s, especially over a multiple-year-period.

Yes, but anything more accurate is even worse for privacy, so…

1 Like

Until any changes are made to discourse, would there be any issue with me regularly running a query on my server to remove the problematic IP addresses (i.e. those where there is also a user_id) from incoming_links, search_logs, and topic_link_clicks?

1 Like

I think it’s the other way around. If there is a user id, you at least had the opportunity to ask for consent. I think the IP addresses without user ID are more problematic, since they belong to unknown people who were just passing by, and never gave permission to store any personal data at all.

To answer your question, I don’t think it will cause issues. Don’t forget to clean up your access logs as well.

1 Like