Problematic IP address fields

riking · March 25, 2018, 7:06am

Continuing the discussion from Providing data for GDPR:

I did a first pass over Discourse’s tables, and I found several places where IP addresses are being accidentally correlated with user IDs. This is toxic data generating liability for Discourse forums.

List of problematic IP address fields:

incoming_links: stores timestamped IP address correlated with user ID and an exact post ID, topic ID, and Referer: header
Fixed Storage: PR#5826
search_logs: stores timestamped IP address correlated with user ID and exact search term
Fixed Storage, Retention: PR#5851
topic_link_clicks: stores timestamped IP address correlated with user ID and clicked link
Fixed Storage: PR#5852
topic_views: stores timestamped IP address
Fixed Storage: PR#5850
user_profile_views: stores timestamped IP address

Non-problematic IP address fields:

user_auth_tokens, user_auth_token_logs: stores timestamped IP address correlated with user ID and device identifier
- Data purged periodically, but there should be a conditional notice in the Privacy Policy if verbose auth token logs are enabled
screened_email, screened_url, screened_ip_addresses: Only created when a user is banned for being a spammer.
api_keys: List of IPs inputted by admin.

Filing this as a privacy-bug.

codinghorror · March 25, 2018, 8:14am

You mean like this?

I am unclear what you mean by “accidental”?

incoming links are from a remote client that may or may not be logged in, in which case IP is relevant
searches are from a remote client that may or may not be logged in, in which case IP is relevant
outgoing links are clicked by a remote client that may or may not be logged in, in which case IP is relevant

I’ll also remind you the definition of bug, which is listed here,

A bug report means something is broken, preventing normal/typical use of Discourse

and recategorize this.

riking · March 25, 2018, 8:27am

Correct, but from what I can tell the IP is being saved even if you’re logged in. The IP is shown nowhere in the UI and has no data retention policies attached (a .cleanup! method, etc).

That’s why I called that a bug but the topic_view not.

RGJ · March 25, 2018, 8:54am

Although it’s not ‘broken’, this is something that will be preventing normal use of Discourse, since storing IP addresses without consent or a good reason is going to be illegal in a significant part of the world really soon.

Just for my understanding, how is storing the IP address relevant for incoming links and searches?

I do understand the outgoing links case, (although using a cookie would be a cleaner solution).

codinghorror · March 25, 2018, 10:00am

Yes, just like the EU’s cookie law prevented use of websites without a cookie notice. Total jurisdictional destruction, worldwide, immediately preventing use of every single website without a cookie notice.

If the argument is that the IP isn’t necessary, that is fine, but filing it as a bug will be met with extreme resistance.

RGJ · March 25, 2018, 10:17am

Actually. Maybe you don’t get to see them, but almost every EU-based site that I visit for the first time does show a cookie warning.

Companies wanting to use Discourse are asking (us) questions about this, and they will choose forum software that is compliant, simply because it’s a checklist item for the legal department.

Don’t forget that GDPR is pretty forgiving: as long as you communicate well and as long as you have a good use for things, you can get away with a lot. So that is why I am genuinely asking what those IP addresses are used for in case of incoming links and search.

RGJ · March 25, 2018, 10:23am

I do understand that it is hard to feel the sentiment in Europe here if you’re not actually in Europe. Maybe a good comparison is to look how a lot of US based online services are responding to SESTA right now: some of them are even completely or partially closing down, just because they don’t know how to comply.

Well, GDPR is not causing that much panic, but people do want to be sure that they’re compliant. And as long as they’re not sure, they’re not even starting a forum.

sam · March 25, 2018, 11:47pm

I am happy to strip IP address for logged on users for those 3 cases, I don’t see it as adding any extra value anyway cause we have user_id.

If we want a log of all “historic” ip addresses a user had we need a different table for that. Storing IP addresses at random for logged on users is pointless.

Seems like and easy change to me, make ip_address nullable on the table, and then scrub.

I feel this whole line of argument is dangerous and bad.

Remove IP address logging from X tables because it is pointless and adds no value and random liability.

VS

Remove IP address logging for table because reasons.

(1) is a much much stronger and valuable argument that applies universally. GDPR is intended to protect privacy, demonstrate how privacy is potentially impacted and then make a case for the change

RGJ · March 26, 2018, 8:05am

I feel so too, it was more like a sidestep (that’s why I made a separate post).
It was merely a response to the EU Cookie Law being brought into this discussion.

Back on the topic -

It’s not about scrubbing them, it’s about not storing them in the first place.

I’m still wondering for the reason for IP addresses for incoming links and search logs?

sam · March 26, 2018, 8:18am

Scrub relates to the migration that add the feature, new rows should either store user id or up, not both

Bas · March 26, 2018, 9:11am

And even if you are in Europe, you only feel the sentiment if you are in a big org or are active within online marketing, recruiting etc.
But within these fields, it is VERY much a topic du jour. I just counted 20+ meetups about GDPR in April within a 2 hours drive: https://www.eventbrite.com/d/netherlands–amsterdam/gdpr/

If discourse has an official page, with a big fat green checkmark next to GDPR, that could be quite good for adoption.

riking · March 26, 2018, 9:33pm

“Not storing them in the first place” is a forward-looking change, “scrubbing” is removing the historical data.

Same as the TopicViewItem code, it’s used as a uniqueness measure so only one per day is counted from the same IP.

Could probably use hashed IPs, too.

codinghorror · March 26, 2018, 9:58pm

Not for rolling up IP bans, we can’t. You need the actual IP for that.

riking · March 26, 2018, 10:06pm

But for view counting, click counting, search logs… you absolutely can.

Basically everything except the users table and the screened_ table could be hashed IPs.

And api_keys because that’s configuration, not logging

codinghorror · March 26, 2018, 10:07pm

Possibly, but unless you are volunteering to do the engineering work and regression testing for free, there’s substantial work there for marginal upside.

Probably easier to focus on the low hanging fruit here of stuff that’s easy to do and moves us toward the goal.

KajMagnus · March 27, 2018, 8:56am

Hashed IPs can be brute forced fairly easily (by calculating the hash of all 2^32 IPs and finding the one that matches the relevant hash in the database). Maybe there isn’t much difference between storing a real IP, and a hashed IP?

But what if a secret salt is included in the hash? And the secret salt is forgotten & replaced with a new secret salt each month? Then old IP hashes from within the same month, could still be compared with each other. Let’s say some accounts were created a few years ago (edit: or months ago), and now suddenly become active and start misbehaving. Then one can see if they likely belong to the same person, and lookup even more accounts by that person, by looking up the ip hash. But, since the salt was forgotten, it would no longer be possible to “reverse” it and find the real IP. Both privacy and a bit security, at the same time.

(Probably not a good idea to do now, because would be complicated. But maybe good to have in mind … if starting to think about this again … some years later)

RGJ · March 27, 2018, 9:56am

I actually doubt that an IP address has a lot of identifying value nowadays, with mobile internet, wifi and dynamic IP’s, especially over a multiple-year-period.

riking · March 27, 2018, 10:55pm

Yes, but anything more accurate is even worse for privacy, so…

aclarke · April 20, 2018, 12:38pm

Until any changes are made to discourse, would there be any issue with me regularly running a query on my server to remove the problematic IP addresses (i.e. those where there is also a user_id) from incoming_links, search_logs, and topic_link_clicks?

RGJ · April 20, 2018, 9:07pm

I think it’s the other way around. If there is a user id, you at least had the opportunity to ask for consent. I think the IP addresses without user ID are more problematic, since they belong to unknown people who were just passing by, and never gave permission to store any personal data at all.

To answer your question, I don’t think it will cause issues. Don’t forget to clean up your access logs as well.

Topic		Replies	Views
GDPR countdown and compliance Community gdpr	90	14943	June 19, 2018
Option to hide IP addresses from moderators Feature digital-services-act	55	1971	August 18, 2025
GDPR and anonymizing personal data Community gdpr , privacy	75	19412	December 1, 2018
Implementing per-post IP logging Feature	19	1003	October 8, 2023
What IP information does Discourse collect? Support	17	1247	October 10, 2023

Problematic IP address fields

Related topics