Upgrading common password prevention - Pwned Passwords v2

Troy Hunt has released Pwned Passwords v2 has part of his Have I Been Pwned service, which tracks data breaches: Have I Been Pwned: Pwned Passwords

The password checker uses a great privacy-preserving and cache-friendly design that’s dead simple to integrate. Blog post here: Troy Hunt: I Wanna Go Fast: Why Searching Through 500M Pwned Passwords Is So Quick but I’ll post a quick summary:

$ echo -n 'commonpassword' | shasum
bee858a53297f2feec01e084c3e110c296a7fd72  -
$ curl -sL https://api.pwnedpasswords.com/range/BEE85 | grep '8A53297F'

therefore, ‘commonpassword’ has appeared 91 times in processed password dumps.

This offers great support to query a very large dataset without having to have a copy on every single Discourse site. As prior art, WordFence (a WordPress firewall plugin) has integrated it to block admin logins with weak passwords starting today (password resets are enforced on login):

Integrating this as an alternative to the 10k password list (many of which are moot due to length limits) seems like a good idea.

Discourse-hosted sites could use a local copy of the hash lists to avoid excess network requests, while self-installs would need to use the web service with custom caching.

Previous discussions: Min Password Length vs Block Common Passwords


Sure we should start with a plugin first though, also it needs to fallback to our current 10k password list cause you never know with APIs.

1 Like

Yeah I would rather have someone extract from that list the top 10k most common passwords that are 10 chars or more (the minimum allowed Discourse password length). If you would like to submit that as a PR Kane go for it, would be happily accepted.

That requires re-finding the original data breaches and processing them. The database is distributed exclusively as SHA-1 hashes to make it hard to use it as a password spamming list.

We could probably ask Troy to produce a filtered list? Would be work for him though.

Hmm… a bloom filter bitfield on 500M elements with a false-positive probability of one-in-a-thousand (0.001) is, if I’ve done my arithmetic correctly, about 900MB. Certainly too big to ship in core, but might be suitable for those sites which don’t want to take the API call hit (for performance, stability, or privacy)? I’ve contacted Troy to see if he’d be able to provide a 10+ length list.


Sure go ahead and ask him. We cannot add a dependency like this, we need good lists of 10 char plus common passwords.

The good news is that once you get to 11, 12, 13 plus chars the number of duplicated passwords drops by many, many orders of magnitude.


… aaaaand Troy’s said nope. So we’re back to hammering his web service, I guess.

Nah we don’t need his help. If he is unwilling to provide, we will find common aggregated password sources elsewhere.

I guess I will just close this topic then @riking?

Also you misunderstood this, the 10k list we have and distribute is the 10k most common passwords of 10 chars or more. So it is quite concentrated.


So, I had a bit of a brainwave: if ol’ Troy Boy doesn’t want to share a filtered list that meets our minimum length requirements, what if we instead just cracked the hashes on all the short ones, leaving us with just the hashes of the big passwords to reject? We can reject in bulk – any < 10 character string which hashes to a listed value gets tossed, and we can do the comparison in bulk as we go.

Unfortunately, while a basic AWS g2 instance has a fairly beefy GPU available, it’s still going to take a long time (like 50 days or so) to enumerate all the 8 character combinations. Worse, because the list of candidates is so big, you can’t load them all up in one go (you run out of GPU memory), but instead have to split them into chunks (I used the first character of the hash to get 16 buckets), so you’re either going to have to run 16 instances in parallel, or wait 16x as long (which costs the same amount of money, if you stick to using AWS).
That’s just the 8 character passwords, too – 9 characters is going to take significantly longer (less characters takes significantly less time, too, to the point where it’s noise at 6-or-less).

So, unless someone is feeling particularly overloaded with money, or we just want to filter out the really low-hanging fruit, I guess this little experiment was a bust. Pity, would have been a nice way around the problem.


You’d also need to make sure that the k-differential privacy still holds at the end after all the removals - that a 5-char hex prefix still results in minimum 1 matched hash per bucket.

edit: oops replying to closed topics again

1 Like

I was thinking of using the data in a bloom filter, to ship with Discourse (probably as an aftermarket “enhanced password security” setting), rather than as an online service. I don’t hold with the idea of putting a third-party service in the middle of local logins.