therefore, ‘commonpassword’ has appeared 91 times in processed password dumps.
This offers great support to query a very large dataset without having to have a copy on every single Discourse site. As prior art, WordFence (a WordPress firewall plugin) has integrated it to block admin logins with weak passwords starting today (password resets are enforced on login):
Integrating this as an alternative to the 10k password list (many of which are moot due to length limits) seems like a good idea.
Discourse-hosted sites could use a local copy of the hash lists to avoid excess network requests, while self-installs would need to use the web service with custom caching.
Yeah I would rather have someone extract from that list the top 10k most common passwords that are 10 chars or more (the minimum allowed Discourse password length). If you would like to submit that as a PR Kane go for it, would be happily accepted.
That requires re-finding the original data breaches and processing them. The database is distributed exclusively as SHA-1 hashes to make it hard to use it as a password spamming list.
We could probably ask Troy to produce a filtered list? Would be work for him though.
Hmm… a bloom filter bitfield on 500M elements with a false-positive probability of one-in-a-thousand (0.001) is, if I’ve done my arithmetic correctly, about 900MB. Certainly too big to ship in core, but might be suitable for those sites which don’t want to take the API call hit (for performance, stability, or privacy)? I’ve contacted Troy to see if he’d be able to provide a 10+ length list.
So, I had a bit of a brainwave: if ol’ Troy Boy doesn’t want to share a filtered list that meets our minimum length requirements, what if we instead just cracked the hashes on all the short ones, leaving us with just the hashes of the big passwords to reject? We can reject in bulk – any < 10 character string which hashes to a listed value gets tossed, and we can do the comparison in bulk as we go.
Unfortunately, while a basic AWS g2 instance has a fairly beefy GPU available, it’s still going to take a long time (like 50 days or so) to enumerate all the 8 character combinations. Worse, because the list of candidates is so big, you can’t load them all up in one go (you run out of GPU memory), but instead have to split them into chunks (I used the first character of the hash to get 16 buckets), so you’re either going to have to run 16 instances in parallel, or wait 16x as long (which costs the same amount of money, if you stick to using AWS).
That’s just the 8 character passwords, too – 9 characters is going to take significantly longer (less characters takes significantly less time, too, to the point where it’s noise at 6-or-less).
So, unless someone is feeling particularly overloaded with money, or we just want to filter out the really low-hanging fruit, I guess this little experiment was a bust. Pity, would have been a nice way around the problem.
You’d also need to make sure that the k-differential privacy still holds at the end after all the removals - that a 5-char hex prefix still results in minimum 1 matched hash per bucket.
I was thinking of using the data in a bloom filter, to ship with Discourse (probably as an aftermarket “enhanced password security” setting), rather than as an online service. I don’t hold with the idea of putting a third-party service in the middle of local logins.