Not for rolling up IP bans, we can’t. You need the actual IP for that.
But for view counting, click counting, search logs… you absolutely can.
Basically everything except the
users table and the
screened_ table could be hashed IPs.
And api_keys because that’s configuration, not logging
Possibly, but unless you are volunteering to do the engineering work and regression testing for free, there’s substantial work there for marginal upside.
Probably easier to focus on the low hanging fruit here of stuff that’s easy to do and moves us toward the goal.
Hashed IPs can be brute forced fairly easily (by calculating the hash of all 2^32 IPs and finding the one that matches the relevant hash in the database). Maybe there isn’t much difference between storing a real IP, and a hashed IP?
But what if a secret salt is included in the hash? And the secret salt is forgotten & replaced with a new secret salt each month? Then old IP hashes from within the same month, could still be compared with each other. Let’s say some accounts were created a few years ago (edit: or months ago), and now suddenly become active and start misbehaving. Then one can see if they likely belong to the same person, and lookup even more accounts by that person, by looking up the ip hash. But, since the salt was forgotten, it would no longer be possible to “reverse” it and find the real IP. Both privacy and a bit security, at the same time.
(Probably not a good idea to do now, because would be complicated. But maybe good to have in mind … if starting to think about this again … some years later)
I actually doubt that an IP address has a lot of identifying value nowadays, with mobile internet, wifi and dynamic IP’s, especially over a multiple-year-period.
Yes, but anything more accurate is even worse for privacy, so…
Until any changes are made to discourse, would there be any issue with me regularly running a query on my server to remove the problematic IP addresses (i.e. those where there is also a user_id) from incoming_links, search_logs, and topic_link_clicks?
I think it’s the other way around. If there is a user id, you at least had the opportunity to ask for consent. I think the IP addresses without user ID are more problematic, since they belong to unknown people who were just passing by, and never gave permission to store any personal data at all.
To answer your question, I don’t think it will cause issues. Don’t forget to clean up your access logs as well.
The GDPR is something I take very seriously.
However, my forums do not have legal teams to pick through this new and poorly-defined law.
The fines are huge, and there are always axe-grinding members looking to cause trouble for a forum. For me to keep running forums, I need to know the software I’m using is compliant with the new law.
If I interpret the law correctly then we need to ensure the following:
If IPs have been stored for users without their consent, they absolutely need to be scrubbed from our database and no longer stored for anonymous visitors.
When a signed up (or signing up) user visits the forum they need to see a consent screen with an unticked box and an explanation of how the IP will be used
If consent is not given, they cannot be allowed to use the forum.
For the record, I absolutely deplore laws like this as do a poor job of protecting our rights yet they harm millions of businesses and scare the hell out of well-meaning and ethical operators.
I’m absolutely relying on the Discourse team here to take some action to protect its forum operators.
They can be stored, but no longer than necessary for a legitimate purpose.
For rate limiting, there is a legitimate interest and this period is pretty short.
For deduplicating link clicks, there is a legitimate interest but they need only to be stored in Redis for 24 hours. I don’t see any reason at all to keep them in the database.
I don’t see the purpose or a legitimate interest for keeping IP addresses in search logs or incoming links.
In contrast to the opening post I do think the topic_views and user_profile_views are problematic. After all, Redis is already deduplicating IP addresses so there is no need to store the IP address longer than
topic view duration hours.
Thanks for the info. Out of interest, where are legitimate purposes and storage limits defined in the lawbooks?
Lawful purposes and legitimate interests are in article 6 of GDPR.
Recital 49 talks about usage of data for network and information security.
Recital 47 mentions fraud prevention and direct marketing as a legitimate interest. Deduplicating link clicks and topic views could be considered fraud prevention.
There are no hard storage limits defined. The time you need to keep an IP address in order to deduplicate statistics depends on the granularity of the accumulated statistics.
Sent in the first PR for cleaning this up: IncomingLink: do not store IP of logged-in users by riking · Pull Request #5826 · discourse/discourse · GitHub
Just sent in 3 more PRs:
aaaand the linkback bot is going crazy with the edits to the OP, oops…
Very good, @sam can review these and make the call on 2.0 versus 2.1 depending on risk.
Although I do absolutely welcome these PR’s I do want to emphasize that storing the IP addresses of visitors without an account (for a longer time than needed for deduplication) is a much more problematic issue since those people cannot easily be asked to give their consent.
Topic_views vs Topic.views
Yeah, I was starting to work on that and it’s a bit tricky due to all the various ways that topic view data is used for logged-in users! And topic views are interesting in that only the first time a user or IP sees a topic is counted right now - it doesn’t reset daily like some of the other data.
Topic_views vs Topic.views
One thing I should mention since it can help with GDPR stuff is when IPs are anonymized all of the problematic IPs identified in the OP are replaced.
This behavior is only available via plugins right now, but it does work.
12 posts were merged into an existing topic: GDPR countdown and compliance
@riking once we get ALL of these sorted we can start looking at “data hoarding” reduction.
So, for example we can roll up incoming links daily throwing away IPs and only including anon vs logged in counts per day (and follow a similar pattern for search)
But first let’s sort out all these PRs.