Problematic IP address fields

gdpr
privacy

(Kane York) #1

Continuing the discussion from Providing data for GDPR:

I did a first pass over Discourse’s tables, and I found several places where IP addresses are being accidentally correlated with user IDs. This is toxic data generating liability for Discourse forums.

List of problematic IP address fields:

  • :white_check_mark: :x: incoming_links: stores timestamped IP address correlated with user ID and an exact post ID, topic ID, and Referer: header
    Fixed Storage: PR#5826
  • :white_check_mark: :white_check_mark: search_logs: stores timestamped IP address correlated with user ID and exact search term
    Fixed Storage, Retention: PR#5851
  • :white_check_mark: topic_link_clicks: stores timestamped IP address correlated with user ID and clicked link
    Fixed Storage: PR#5852
  • :white_check_mark: topic_views: stores timestamped IP address
    Fixed Storage: PR#5850
  • user_profile_views: stores timestamped IP address

Non-problematic IP address fields:

  • user_auth_tokens, user_auth_token_logs: stores timestamped IP address correlated with user ID and device identifier
    • Data purged periodically, but :warning: there should be a conditional notice in the Privacy Policy if verbose auth token logs are enabled
  • screened_email, screened_url, screened_ip_addresses: Only created when a user is banned for being a spammer.
  • api_keys: List of IPs inputted by admin.

Filing this as a privacy-bug.


GDPR countdown and compliance
GDPR countdown and compliance
Legal Tools Plugin
(Jeff Atwood) #2

You mean like this?

I am unclear what you mean by “accidental”?

  • incoming links are from a remote client that may or may not be logged in, in which case IP is relevant

  • searches are from a remote client that may or may not be logged in, in which case IP is relevant

  • outgoing links are clicked by a remote client that may or may not be logged in, in which case IP is relevant

I’ll also remind you the definition of bug, which is listed here,

A bug report means something is broken, preventing normal/typical use of Discourse

and recategorize this.


(Kane York) #3

Correct, but from what I can tell the IP is being saved even if you’re logged in. The IP is shown nowhere in the UI and has no data retention policies attached (a .cleanup! method, etc).

That’s why I called that a bug but the topic_view not.


(Richard - DiscourseHosting.com) #4

Although it’s not ‘broken’, this is something that will be preventing normal use of Discourse, since storing IP addresses without consent or a good reason is going to be illegal in a significant part of the world really soon.

Just for my understanding, how is storing the IP address relevant for incoming links and searches?

I do understand the outgoing links case, (although using a cookie would be a cleaner solution).


(Jeff Atwood) #5

Yes, just like the EU’s cookie law prevented use of websites without a cookie notice. Total jurisdictional destruction, worldwide, immediately preventing use of every single website without a cookie notice. :woman_facepalming:

If the argument is that the IP isn’t necessary, that is fine, but filing it as a bug will be met with extreme resistance.


(Richard - DiscourseHosting.com) #6

Actually. Maybe you don’t get to see them, but almost every EU-based site that I visit for the first time does show a cookie warning.

Companies wanting to use Discourse are asking (us) questions about this, and they will choose forum software that is compliant, simply because it’s a checklist item for the legal department.

Don’t forget that GDPR is pretty forgiving: as long as you communicate well and as long as you have a good use for things, you can get away with a lot. So that is why I am genuinely asking what those IP addresses are used for in case of incoming links and search.


(Richard - DiscourseHosting.com) #7

I do understand that it is hard to feel the sentiment in Europe here if you’re not actually in Europe. Maybe a good comparison is to look how a lot of US based online services are responding to SESTA right now: some of them are even completely or partially closing down, just because they don’t know how to comply.

Well, GDPR is not causing that much panic, but people do want to be sure that they’re compliant. And as long as they’re not sure, they’re not even starting a forum.


(Sam Saffron) #8

I am happy to strip IP address for logged on users for those 3 cases, I don’t see it as adding any extra value anyway cause we have user_id.

If we want a log of all “historic” ip addresses a user had we need a different table for that. Storing IP addresses at random for logged on users is pointless.

Seems like and easy change to me, make ip_address nullable on the table, and then scrub.

I feel this whole line of argument is dangerous and bad.

  1. Remove IP address logging from X tables because it is pointless and adds no value and random liability.

VS

  1. Remove IP address logging for table because reasons.

(1) is a much much stronger and valuable argument that applies universally. GDPR is intended to protect privacy, demonstrate how privacy is potentially impacted and then make a case for the change


(Richard - DiscourseHosting.com) #9

I feel so too, it was more like a sidestep (that’s why I made a separate post).
It was merely a response to the EU Cookie Law being brought into this discussion.

Back on the topic -

It’s not about scrubbing them, it’s about not storing them in the first place.


I’m still wondering for the reason for IP addresses for incoming links and search logs?


(Sam Saffron) #10

Scrub relates to the migration that add the feature, new rows should either store user id or up, not both


(Bas van Leeuwen) #11

And even if you are in Europe, you only feel the sentiment if you are in a big org or are active within online marketing, recruiting etc.
But within these fields, it is VERY much a topic du jour. I just counted 20+ meetups about GDPR in April within a 2 hours drive: https://www.eventbrite.com/d/netherlands–amsterdam/gdpr/

If discourse has an official page, with a big fat green checkmark next to GDPR, that could be quite good for adoption.


(Kane York) #12

“Not storing them in the first place” is a forward-looking change, “scrubbing” is removing the historical data.

Same as the TopicViewItem code, it’s used as a uniqueness measure so only one per day is counted from the same IP.

Could probably use hashed IPs, too.


(Jeff Atwood) #13

Not for rolling up IP bans, we can’t. You need the actual IP for that.


(Kane York) #14

But for view counting, click counting, search logs… you absolutely can.

Basically everything except the users table and the screened_ table could be hashed IPs.

And api_keys because that’s configuration, not logging


(Jeff Atwood) #15

Possibly, but unless you are volunteering to do the engineering work and regression testing for free, there’s substantial work there for marginal upside.

Probably easier to focus on the low hanging fruit here of stuff that’s easy to do and moves us toward the goal.


(KajMagnus) #16

Hashed IPs can be brute forced fairly easily (by calculating the hash of all 2^32 IPs and finding the one that matches the relevant hash in the database). Maybe there isn’t much difference between storing a real IP, and a hashed IP?

But what if a secret salt is included in the hash? And the secret salt is forgotten & replaced with a new secret salt each month? Then old IP hashes from within the same month, could still be compared with each other. Let’s say some accounts were created a few years ago (edit: or months ago), and now suddenly become active and start misbehaving. Then one can see if they likely belong to the same person, and lookup even more accounts by that person, by looking up the ip hash. But, since the salt was forgotten, it would no longer be possible to “reverse” it and find the real IP. Both privacy and a bit security, at the same time.

(Probably not a good idea to do now, because would be complicated. But maybe good to have in mind … if starting to think about this again … some years later)


(Richard - DiscourseHosting.com) #17

I actually doubt that an IP address has a lot of identifying value nowadays, with mobile internet, wifi and dynamic IP’s, especially over a multiple-year-period.


(Kane York) #19

Yes, but anything more accurate is even worse for privacy, so…


#20

Until any changes are made to discourse, would there be any issue with me regularly running a query on my server to remove the problematic IP addresses (i.e. those where there is also a user_id) from incoming_links, search_logs, and topic_link_clicks?


(Richard - DiscourseHosting.com) #21

I think it’s the other way around. If there is a user id, you at least had the opportunity to ask for consent. I think the IP addresses without user ID are more problematic, since they belong to unknown people who were just passing by, and never gave permission to store any personal data at all.

To answer your question, I don’t think it will cause issues. Don’t forget to clean up your access logs as well.