Category Archives: have I been pwned?

The Race to the Bottom of Credential Stuffing Lists; Collections #2 Through #5 (and More)

Presently sponsored by: Live Workshop! Watch the Varonis DFIR team investigate a cyberattack using our data-centric security stack

The Race to the Bottom of Credential Stuffing Lists; Collections #2 Through #5 (and More)

A race to the bottom is a market condition in which there is a surplus of a commodity relative to the demand for it. Often the term is used to describe labour conditions (workers versus jobs), and in simple supply and demand terms, once there's so much of something all vying for the attention of those consuming it, the value of it plummets.

On reflecting over the last 3 and a half weeks, this is where we seem to be with credential stuffing lists today and I want to use this blog post to explain the thinking whilst also addressing specific questions I've had regarding Collections #2 through #5.

The 773 Million Record "Collection #1" Data Breach

On Thursday 17 Jan, I loaded 773M records into Have I Been Pwned (HIBP) which I titled "Collection #1". I explained how this data originated from multiple different sources and was likely obtained over a period of many years before being amalgamated together and passed around as one massive stash. There were 2.7B rows of email addresses and passwords in total, but only 1.6B them were unique (my own identical record appeared half a dozen times). In other words, there was a huge amount of redundancy.

I made the call to load the data into HIBP based primarily on 3 facts:

  1. The data was sufficiently unique: more than 18% of the email addresses had not been seen in HIBP before
  2. The data was in broad circulation: multiple parties had contacted me and passed on Collection #1
  3. There was a large number of previously unseen passwords: of the 21M unique ones, half of them weren't already in HIBP's Pwned Passwords

Being conscious that there would be many questions about this data and that the origins and impact of it could be easily misrepresented, I carefully detailed every important fact. I pushed the blog post out on that Thursday morning my time and later that day, hopped on a plane to Europe. As the rest of world woke up to the story, all hell broke loose. I have never, ever received so many emails, tweets, blog comments and every other form of communication you can imagine in such a short period of time. I'd also never seen so much traffic on HIBP:

I spent a significant part of the flight chewing through Emirates' bandwidth just responding to messages. I landed in Oslo, met friends and drove up into the mountains for a snowboarding trip with the flood of communications continuing. Jet lagged, overwhelmed by it all and frankly, just wanting downtime with good company, I turned on the out of office, closed comments on the blog post and almost completely stopped engaging on Twitter. (Side note: Scott Helme and I talked about burnout in my weekly update from London, in part due to the experiences I had dealing with the above.)

If I'm honest, that experience with the flood of communication coupled with disconnecting from life for a few days in a remote cabin with friends had a profound effect on me in many ways. I'm sure I'll talk more about them in future, but one was that I've very consciously reduced my engagement on email and Twitter frankly, to save my sanity. That's a bit tangential here though, back to Collection #1.

I'm frustrated about the hyperbole this incident managed to attract. The mass media picked it up with gusto and it made headlines all around the world in the most mainstream of publications. Inevitably, whether deliberately for the headlines or accidentally because it's simply not the world they live in, the truth was stretched time and time again. Despite my best efforts to report everything I knew with candour, things got out of control. For the most part I ignored this, only occasionally venting my frustration as someone brought it to the fore:

Of course, there was nothing missing from the post and each time I asked the question it was met with silence. (Incidentally, Lorenzo who wrote that Motherboard piece is a top-notch infosec journo I've worked with many times before and he reported accurately in that piece.) I'm sharing this because I want to ensure that those who expressed their dismay at the way this story unfolded understand that it bugged the hell out of me too.

But I will say this: because this incident reached an unprecedented number of people and gained such worldwide traction, the impact of it on normal, everyday people's behaviour was significant. They learned about the phenomenon that is data breaches and credential stuffing lists, they read about password managers and 2FA and inevitably, many of them subsequently made behavioural changes to their security practices. Over-inflated headlines or not, the outcome of this on everyday consumers was positive.

The Other Collections

When I was originally contacted about Collection #1, that was the extent I knew of this series - that there was 1 collection. But very quickly it became apparent that it was merely the first of 5 collections and it was far from the biggest. Collection #1 was 87GB of data but collections #2 through #5 totalled another 845GB on top of that. Instead of the 2.7B rows from the Collection #1, the headlines were now talking about 25B which, admittedly, is quite the catchy title. Dozens of people reached out to me with links to the additional data and indeed, the media lapped up news of the larger collections as well. Inevitably, I got bombarded with questions about the subsequent collections:

The Race to the Bottom of Credential Stuffing Lists; Collections #2 Through #5 (and More)

Keeping in mind my previous comments about overwhelming amounts of communication and workload, the thought of processing a 10x volume of data over Collection #1 wasn't exactly exciting me. Nevertheless, I grabbed the additional collections whilst travelling, flew home just over a week ago and began analysis. Before doing that, I had a working theory that the subsequent collections would be more of the same, but I wanted hard numbers on it so I began running the data against the existing 6.5B records in HIBP.

Spam, Spam, Spam Everywhere

Back when I originally began looking at Collection #1, one of the first things I did was to run a sample selection of email addresses against HIBP to get a sense of how many of them were unique. As mentioned earlier, it turned out to be just over 18% which was quite significant for such a large list. The very first thing I did with collections #2 through #5 was to choose slices of the data and check them against HIBP. This meant choosing a random file from amongst the 85k+ in the data, extracting all the email addresses then grabbing a random 100 sample and looking for uniqueness. After checking hundreds of files, here's when I found:

Tested 457 files, 280 were a 100% match
Tested 44,426 addresses, found 5,282 unique ones not already in HIBP, only 11.89 % unique

For the sake of transparency, I've published the complete output of this process which shows just how much crossover there is with existing data. As you scroll through that list, you'll see that over 61% of the files tested were a 100% much to HIBP; every single one of those random 100 email addresses tested was already in there. (Sidenote: after running the report, I realised that some of the source files didn't contain email addresses and as such reported "Of 0 random email addresses, 0 are already in HIBP". That's fine, but it skewed the 61% number down as the file was counted as not being an exact match.)

Some of these were quite predictable:

Collection #5\Collection #5\Dump HASH\
Of 100 random email addresses, 100 are already in HIBP

There's an easy explanation for that:

Then there were files at the other end of the extreme:

Collection #5\Collection #5\EU combos\49.txt
Of 100 random email addresses, 1 are already in HIBP

Curious, I took a closer look and found 100k rows heavily orientated towards Eastern European TLDs; over 20k .ua (Ukraine), another 10k .uz (Uzbekistan), 5k .kz (Kazakhstan) etc. I have no idea how many of these are actual addresses nor which breaches they originated from if they're indeed genuine, obviously there's nothing given away by the file name. The problem with all of this data (as with Collection #1), is that it's just about impossible to establish authenticity and a bunch of it is very likely not what it's represented to be.

Here's a perfect example: when running the check, one of the very first results I saw was this one:

The Race to the Bottom of Credential Stuffing Lists; Collections #2 Through #5 (and More)

It piqued my interest as it's an Aussie TLD for a site I'd never heard of yet apparently, 100% of the email addresses are already in HIBP. So I delved into the file and was immediately struck by the occurrence of a different TLD which, upon counting its occurrences across the 436-line file, showed a strangely high hit rate:

The Race to the Bottom of Credential Stuffing Lists; Collections #2 Through #5 (and More)

The file itself was then a combination of email addresses and SHA-1 hashes along with email addresses and then simply the number 1 after it. This is unusual as not only is there no consistency to the format, but it's also clearly compromised of different types of information.

During the course of the last week, I had a few chats with Vinny Troia of Night Lion Security. Vinny has supported HIBP in the past with data he's located floating around the web and we had a good discussion about the nature of these collections which he was also analysing. He also lamented the volume of garbage in them, pointing to examples such as this (the asterisks all represent the same 4-digit number):


Then there's my own data. I'd already found it in Collection #1 half a dozen times with an old throwaway password I had legitimately used many years ago. I noted it in the original blog post but didn't dig any further. This time, however, I probed deeper; I wanted context for the data.

Here it is in "Collection #2\Collection #2\DUMPS dehashed\"

The Race to the Bottom of Credential Stuffing Lists; Collections #2 Through #5 (and More)

I had to look up in order to work out what it was. Turns out it's a Vietnamese e-commerce site selling phones so yeah, not exactly the sort of place I'd frequent.

And here it is in "Collection #5\DUMP dehashed\  add pass.txt":

The Race to the Bottom of Credential Stuffing Lists; Collections #2 Through #5 (and More)

No, I've not screwed up the image, the file it's in is identical to the Vietnamese phone one. The password is identical too and firstly, under no circumstances did I ever use that password on Dropbox and secondly, the password I had in the Dropbox breach was randomly generated and exposed as a bcrypt hash I shared publicly when reporting on the breach.

So you see my point about "spam, spam, spam" - these collections are absolutely riddled with junk. That's not to say they don't contain legitimate usernames and passwords because quite clearly, some of them are, rather that the actual unique legitimate entries across all the collections is a small subset of what the headlines suggest.

It's a Very Deep Bottom

Following the events above, I received dozens of messages (maybe even hundreds, I honestly lost track) about other collections of credentials. Not collections represented as being part of the same series (i.e. Collection #6), but rather entirely separate sets of data. A few thousand here from a phishing page, a few hundred thousand over there in a public Google Doc, untold numbers more in pastes that HIBP may not have already indexed. I've seen a lot of breached data over a lot of years but even for me, I was honestly left a bit stunned by all of this. It. Just. Never. Ends.

A couple of days ago there was a story going around about yet another collection of data that began as follows:

A massive 600 gigabyte file containing about 2.2 billion compromised usernames and passwords has been spotted floating about the dark web, freely available to anyone who cares to download it via torrent.

As I've lamented before, stories of the dark web are frequently exaggerated and the reality is that huge volumes of data like this are socialised much more broadly "on the clear web" than the headlines suggest. For example:

The Race to the Bottom of Credential Stuffing Lists; Collections #2 Through #5 (and More)

I could go on with literally dozens of similar examples; a billion credentials here, another few billion there, many freely circulating and others costing inconsequential amounts of money:

The Race to the Bottom of Credential Stuffing Lists; Collections #2 Through #5 (and More)

And every time you think you've seen how deep the iceberg goes, you realise it's just another tip floating around a sea of our personal data:

   /$$    /$$$$$$        /$$$$$$$  /$$$$$$ /$$       /$$       /$$$$$$  /$$$$$$  /$$   /$$
 /$$$$   /$$__  $$      | $$__  $$|_  $$_/| $$      | $$      |_  $$_/ /$$__  $$| $$$ | $$
|_  $$  |__/  \ $$      | $$  \ $$  | $$  | $$      | $$        | $$  | $$  \ $$| $$$$| $$
  | $$     /$$$$$/      | $$$$$$$   | $$  | $$      | $$        | $$  | $$  | $$| $$ $$ $$
  | $$    |___  $$      | $$__  $$  | $$  | $$      | $$        | $$  | $$  | $$| $$  $$$$
  | $$   /$$  \ $$      | $$  \ $$  | $$  | $$      | $$        | $$  | $$  | $$| $$\  $$$
 /$$$$$$|  $$$$$$/      | $$$$$$$/ /$$$$$$| $$$$$$$$| $$$$$$$$ /$$$$$$|  $$$$$$/| $$ \  $$
|______/ \______/       |_______/ |______/|________/|________/|______/ \______/ |__/  \__/
             /$$ /$$$$$$$$ /$$      /$$  /$$$$$$  /$$$$$$ /$$        /$$$$$$  /$$         
            /$$/| $$_____/| $$$    /$$$ /$$__  $$|_  $$_/| $$       /$$__  $$|  $$        
           /$$/ | $$      | $$$$  /$$$$| $$  \ $$  | $$  | $$      | $$  \__/ \  $$       
          /$$/  | $$$$$   | $$ $$/$$ $$| $$$$$$$$  | $$  | $$      |  $$$$$$   \  $$      
         /$$/   | $$__/   | $$  $$$| $$| $$__  $$  | $$  | $$       \____  $$   \  $$     
        /$$/    | $$      | $$\  $ | $$| $$  | $$  | $$  | $$       /$$  \ $$    \  $$    
       /$$/     | $$$$$$$$| $$ \/  | $$| $$  | $$ /$$$$$$| $$$$$$$$|  $$$$$$/     \  $$   
      |__/      |________/|__/     |__/|__/  |__/|______/|________/ \______/       \__/   

In case the ASCII art is lost on you, that's "13 BILLION /EMAILS\" in a readme file accompanied by an 88GB file containing that number of email and password pairs. It was about this time that the penny finally dropped in terms of just how comedic it was becoming to have numbers that seemed both artificially large and apparently there for shock value. It's like I'd seen this somewhere before...

The Race to the Bottom of Credential Stuffing Lists; Collections #2 Through #5 (and More)

All of this data in all of these locations has caused me to ask some pretty fundamental questions about the point of these lists as they relate to HIBP:

What's the point of loading billions after billions of email addresses from credential stuffing lists? What makes a new list worth adding to the 6.5B addresses already in HIBP? And if I'm going to be honest with myself, what's changed since I loaded Collection #1 that would cause me not to load subsequent lists?

The answer to the last question is a combination of the frenzy that first list created coupled with the emergence of untold numbers of other lists. What's changed is that there's way more data circulating than I've ever seen before and if I go loading all of that into HIBP, I fear the signal to noise ratio will go through the floor. Some people already felt that was the case with Collection #1 and whilst I still maintain loading that list was the right thing to do in the climate of the time, a constant stream of notifications about old incidents that have merely re-purposed the same data is quickly going to create a groundswell of unhappy subscribers.


Somehow, the Collection #1 incident turned into a feeding frenzy of media, breach traders, security firms and industry voices alike, all vying for a piece of the attention. Whilst there was undoubtedly value in the awareness it created, an increasing infatuation on which list is the largest or who's sitting on the largest stash of data is just downright counterproductive. It becomes a sideshow of superlative news headlines as the discussion turns to "who's is biggest" rather than "what should we actually be doing about this".

For now, I don't see subsequent lists like these going into HIBP unless there's something sufficiently unique about them. Users of the service have a pretty good idea by now where they've been exposed and what they should do about it, I want to keep focusing on the discrete incidents that are clearly attributable back to a source. Speaking of which:

The 773 Million Record “Collection #1” Data Breach

Presently sponsored by: Live Workshop! Watch the Varonis DFIR team investigate a cyberattack using our data-centric security stack

The 773 Million Record

Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1". Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.

Let's start with the raw numbers because that's the headline, then I'll drill down into where it's from and what it's composed of. Collection #1 is a set of email addresses and passwords totalling 2,692,818,238 rows. It's made up of many different individual data breaches from literally thousands of different sources. (And yes, fellow techies, that's a sizeable amount more than a 32-bit integer can hold.)

In total, there are 1,160,253,228 unique combinations of email addresses and passwords. This is when treating the password as case sensitive but the email address as not case sensitive. This also includes some junk because hackers being hackers, they don't always neatly format their data dumps into an easily consumable fashion. (I found a combination of different delimiter types including colons, semicolons, spaces and indeed a combination of different file types such as delimited text files, files containing SQL statements and other compressed archives.)

The unique email addresses totalled 772,904,991. This is the headline you're seeing as this is the volume of data that has now been loaded into Have I Been Pwned (HIBP). It's after as much clean-up as I could reasonably do and per the previous paragraph, the source data was presented in a variety of different formats and levels of "cleanliness". This number makes it the single largest breach ever to be loaded into HIBP.

There are 21,222,975 unique passwords. As with the email addresses, this was after implementing a bunch of rules to do as much clean-up as I could including stripping out passwords that were still in hashed form, ignoring strings that contained control characters and those that were obviously fragments of SQL statements. Regardless of best efforts, the end result is not perfect nor does it need to be. It'll be 99.x% perfect though and that x% has very little bearing on the practical use of this data. And yes, they're all now in Pwned Passwords, more on that soon.

That's the numbers, let's move onto where the data has actually come from.

Data Origins

Last week, multiple people reached out and directed me to a large collection of files on the popular cloud service, MEGA (the data has since been removed from the service). The collection totalled over 12,000 separate files and more than 87GB of data. One of my contacts pointed me to a popular hacking forum where the data was being socialised, complete with the following image:

The 773 Million Record

As you can see at the top left of the image, the root folder is called "Collection #1" hence the name I've given this breach. The expanded folders and file listing give you a bit of a sense of the nature of the data (I'll come back to the word "combo" later), and as you can see, it's (allegedly) from many different sources. The post on the forum referenced "a collection of 2000+ dehashed databases and Combos stored by topic" and provided a directory listing of 2,890 of the files which I've reproduced here. This gives you a sense of the origins of the data but again, I need to stress "allegedly". I've written before about what's involved in verifying data breaches and it's often a non-trivial exercise. Whilst there are many legitimate breaches that I recognise in that list, that's the extent of my verification efforts and it's entirely possible that some of them refer to services that haven't actually been involved in a data breach at all.

However, what I can say is that my own personal data is in there and it's accurate; right email address and a password I used many years ago. Like many of you reading this, I've been in multiple data breaches before which have resulted in my email addresses and yes, my passwords, circulating in public. Fortunately, only passwords that are no longer in use, but I still feel the same sense of dismay that many people reading this will when I see them pop up again. They're also ones that were stored as cryptographic hashes in the source data breaches (at least the ones that I've personally seen and verified), but per the quoted sentence above, the data contains "dehashed" passwords which have been cracked and converted back to plain text. (There's an entirely different technical discussion about what makes a good hashing algorithm and why the likes of salted SHA1 is as good as useless.) In short, if you're in this breach, one or more passwords you've previously used are floating around for others to see.

So that's where the data has come from, let me talk about how to assess your own personal exposure.

Checking Email Addresses and Passwords in HIBP

There'll be a significant number of people that'll land here after receiving a notification from HIBP; about 2.2M people presently use the free notification service and 768k of them are in this breach. Many others, over the years to come, will check their address on the site and land on this blog post when clicking in the breach description for more information. These people all know they were in Collection #1 and if they've read this far, hopefully they have a sense of what it is and why they're in there. If you've come here via another channel, checking your email address on HIBP is as simple as going to the site, entering it in then looking at the results (scrolling further down lists the specific data breaches the address was found in):

The 773 Million Record

But what many people will want to know is what password was exposed. HIBP never stores passwords next to email addresses and there are many very good reasons for this. That link explains it in more detail but in short, it poses too big a risk for individuals, too big a risk for me personally and frankly, can't be done without taking the sorts of shortcuts that nobody should be taking with passwords in the first place! But there is another way and that's by using Pwned Passwords.

This is a password search feature I built into HIBP about 18 months ago. The original intention of it was to provide a data set to people building systems so that they could refer to a list of known breached passwords in order to stop people from using them again (or at least advise them of the risk). This provided a means of implementing guidance from government and industry bodies alike, but it also provided individuals with a repository they could check their own passwords against. If you're inclined to lose your mind over that last statement, read about the k-anonymity implementation then continue below.

Here's how it works: let's do a search for the word "P@ssw0rd" which incidentally, meets most password strength criteria (upper case, lower case, number and 8 characters long):

The 773 Million Record

Obviously, any password that's been seen over 51k times is terrible and you'd be ill-advised to use it anywhere. When I searched for that password, the data was anonymised first and HIBP never received the actual value of it. Yes, I'm still conscious of the messaging when suggesting to people that they enter their password on another site but in the broader scheme of things, if someone is actually using the same one all over the place (as the vast majority of people still do), then the wakeup call this provides is worth it.

As of now, all 21,222,975 passwords from Collection #1 have been added to Pwned Passwords bringing the total number of unique values in the list to 551,509,767.

Whilst I can't tell you precisely what password was against your own record in the breach, I can tell you if any password you're interested in has appeared in previous breaches Pwned Passwords has indexed. If one of yours shows up there, you really want to stop using it on any service you care about. If you have a bunch of passwords and manually checking them all would be painful, give this a go:

This is 1Password's Watchtower feature and it can take all your stored passwords and check them against Pwned Passwords in one go. The same anonymity model is used (neither 1Password nor HIBP ever see your actual password) and it enables bulk checking all in one go. I'm conscious that many people reading this won't be using a password manager of any kind in the first place and that's an absolutely pivotal part of how to deal with this incident so I'll come back to that a little later. Apparently, this feature along with integrated HIBP searches and notifications when new breaches pop up is one of the most-loved features of 1Password which is pretty cool! For some background on that, without me knowing in advance, they launched an early version of this only a day after I released V2 with the anonymity model (incidentally, that was a key motivator for later partnering with them):

For those using Pwned Passwords in their own systems (EVE Online, GitHub, Okta et al),  the API is now returning the new data set and all cache has now been flushed (you should see a very recent "last-modified" response header). All the downloadable files have also been revised up to version 4 and are available on the Pwned Passwords page via download courtesy of Cloudflare or via torrents. They're in both SHA1 and NTLM formats with each ordered both alphabetically by hash and by prevalence (most common passwords first).

Why Load This Into HIBP?

Every single time I came across a data set that's not clearly a breach of a single, easily identifiable service, I ask the question - should this go into HIBP?  There are a number of factors that influence that decision and one of them is uniqueness; is this a sufficiently new set of data with a large volume of records I haven't seen before? In determining that, I take a slice of the email addresses and ran them against HIBP to see how many of them had been seen before. Here's what it looked like after a few hundred thousand checks:

The 773 Million Record

In other words, there's somewhere in the order of 140M email addresses in this breach that HIBP has never seen before.

The data was also in broad circulation based on the number of people that contacted me privately about it and the fact that it was published to a well-known public forum. In terms of the risk this presents, more people with the data obviously increases the likelihood that it'll be used for malicious purposes.

Then there's the passwords themselves and of the 21M+ unique ones, about half of them weren't already in Pwned Passwords. Keeping in mind how this service is predominantly used, that's a significant number that I want to make sure are available to the organisations that rely on this data to help steer their customers away from using higher-risk passwords.

And finally, every time I've asked the question "should I load data I can't emphatically identify the source of?", the response has always been overwhelmingly "yes":

People will receive notifications or browse to the site and find themselves there and it will be one more little reminder about how our personal data is misused. If - like me - you're in that list, people who are intent on breaking into your online accounts are circulating it between themselves and looking to take advantage of any shortcuts you may be taking with your online security. My hope is that for many, this will be the prompt they need to make an important change to their online security posture. And if you find yourself in this data and don't feel there's any value in knowing about it, ignore it. For everyone else, let's move on and establish the risk this presents then talk about fixes.

What's the Risk If My Data Is in There?

I referred to the word "combos" earlier on and simply put, this is just a combination of usernames (usually email addresses) and passwords. In this case, it's almost 2.7 billion of them compiled into lists which can be used for credential stuffing:

Credential stuffing is the automated injection of breached username/password pairs in order to fraudulently gain access to user accounts.

In other words, people take lists like these that contain our email addresses and passwords then they attempt to see where else they work. The success of this approach is predicated on the fact that people reuse the same credentials on multiple services. Perhaps your personal data is on this list because you signed up to a forum many years ago you've long since forgotten about, but because its subsequently been breached and you've been using that same password all over the place, you've got a serious problem.

By pure coincidence, just last week I wrote about credential stuffing attacks and how they led many people to believe that Spotify had suffered a data breach. In that post, I embedded a short video that shows how easily these attacks are automated and I want to include it again here:

Within the first 20 seconds, the author of the video has chosen a combo list just like the one three quarters of a billion people are in via this Combination #1 breach. Another 20 seconds and the software is testing those accounts against Spotify and reporting back with email addresses and passwords that can logon to accounts there. That's how easy it is and also how indiscriminate it is; it's not personal, you're just on the list! (For people wanting to go deeper, check out Shape Security's video on credential stuffing.)

To be clear too, this is not just a Spotify problem. Automated tools exist to leverage these combo lists against all sorts of other online services including ones you shop at, socialise at and bank at. If you found your password in Pwned Passwords and you're using that same one anywhere else, you want to change each and every one of those locations to something completely unique, which brings us to password managers.

Get a Password Manager

You have too many passwords to remember, you know they're not meant to be predictable and you also know they're not meant to be reused across different services. If you're in this breach and not already using a dedicated password manager, the best thing you can do right now is go out and get one. I did that many years ago now and wrote about how the only secure password is the one you can't remember. A password manager provides you with a secure vault for all your secrets to be stored in (not just passwords, I store things like credit card and banking info in mine too), and its sole purpose is to focus on keeping them safe and secure.

A password manager is also a rare exception to the rule that adding security means making your life harder. For example, logging on to a mobile app is dead easy:

I chose the password manager 1Password all those years ago and have stuck with it ever it since. As I mentioned earlier, they partnered with HIBP to help drive people interested in personal security towards better personal security practices and obviously there's some neat integration with the data in HIBP too (there's also a dedicated page explaining why I chose them).

If a digital password manager is too big a leap to take, go old school and get an analogue one (AKA, a notebook). Seriously, the lesson I'm trying to drive home here is that the real risk posed by incidents like this is password reuse and you need to avoid that to the fullest extent possible. It might be contrary to traditional thinking, but writing unique passwords down in a book and keeping them inside your physically locked house is a damn sight better than reusing the same one all over the web. Just think about it - you go from your "threat actors" (people wanting to get their hands on your accounts) being anyone with an internet connection and the ability to download a broadly circulating list Collection #1, to people who can break into your house - and they want your TV, not your notebook!


Because an incident of this size will inevitably result in a heap of questions, I'm going to list the ones I suspect I'll get here then add to it as others come up. It'll help me handle the volume of queries I expect to get and will hopefully make things a little clearer for everyone.

Q. Can you send me the password for my account?
I know I touched on it above but it's always the single biggest request I get so I'm repeating it here. No, I can't send you your password but I can give you a facility to search for it via Pwned Passwords.

Q. How long ago were these sites breached?
It varies. The first site on the list I shared was 000webhost who was breached in 2015, but there's also a file in there which suggests 2008. These are lots of different incidents from lots of different time frames.

Q. What can I do if I'm in the data?
If you're reusing the same password(s) across services, go and get a password manager and start using strong, unique ones across all accounts. Also turn on 2-factor authentication wherever it's available.

Q. I'm responsible for managing a website, how do I defend against credential stuffing attacks?
The fast, easy, free approach is using the Pwned Passwords list to block known vulnerable passwords (read about how other large orgs have used this service). There are services out there with more sophisticated commercial approaches, for example Shape Security's Blackfish (no affiliation with myself or HIBP).

Q. How can I check if people in my organisation are using passwords in this breach?
The entire Pwned Passwords corpus is also published as NTLM hashes. When I originally released these in August last year, I referenced code samples that will help you check this list against the passwords of accounts in an Active Directory environment.

Q. I'm using a unique password on each site already, how do I know which one to change?
You've got 2 options if you want to check your existing passwords against this list: The first is to use 1Password's Watch Tower feature described above. If you're using another password manager already, it's easy to migrate over (you can get a free 1Password trial). The second is to check all your existing passwords directly against the k-anonymity API. It'll require some coding, but's its straightforward and fully documented.

Q. Is there a list of which sites are included in this breach?
I've reproduced a list that was published to the hacking forum I mentioned and that contains 2,890 file names. This is not necessarily complete (nor can I easily verify it), but it may help some people understand the origin of their data a little better.

Q. Will you publish the data in collections #2 through #5?
Until this blog post went out, I wasn't even aware there were subsequent collections. I do have those now and I need to make a call on what to do with them after investigating them further.

Q. Where can I download the source data from?
Given the data contains a huge volume of personal information that can be used to access other people's accounts, I'm not going to direct people to it. I'd also ask that people don't do that in the comments section.

Comments Are Now Closed

After several hundred comments in a very short period of time, I'm closing this post for further contributions. Moderating them has consumed a significant amount of time that I've mostly dealt with whilst flying from Australia to Europe. I now need to focus on a short period of downtime followed by a couple of weeks of conference talks. Thank you all for your engagement, I'll talk more about this post in the next weekly update video I'll post on Friday 25.

Have I Been Pwned – The Sticker

Presently sponsored by: Live Workshop! Watch the Varonis DFIR team investigate a cyberattack using our data-centric security stack

Have I Been Pwned - The Sticker

So today is Have I Been Pwned's (HIBP's) 5th birthday. I started this project out of equal parts community service and curiosity and then somehow, over the last 5 years it's grown into something massive; hundreds of thousands of unique sessions a day, millions of subscribers, working with governments around the world and even fronting up to testify in Congress. I'd love to say I had the foresight to see all this coming but I didn't. Not one bit of it. I just did the things I thought made sense at the time and that was that.

As the 5th birthday approached, I asked people what I should do. There were many good suggestions but chief among those which were actually feasible (and frankly, that I liked the idea of!) was to make stickers. So I did:

Have I Been Pwned - The Sticker

The sticker is 5.97" x 1" (the olden day equivalent of 15.16cm x 2.54cm) with nice rounded corners. I just bought 1,000 of them off Sticker Mule with free shipping all the way down here to Australia and I'll be handing them out on my travels around the world. Apparently this link will get you $10 off your order and give me a $10 credit too so that I can put it towards buying more HIBP stickers and handing them out to more people.

If you'd like to order stickers, you can do it yourself directly via the Sticker Mule website (again, that's a referral link in both our best interests!) Sticker Mule doesn't have a way for me to share the exact approved artwork, but I had a better idea anyway; there's now a public GitHub repo on my account called hibp-stickers. You can simply grab the artwork from there (use the .ai Illustrator file in the "Wide Logo" folder) then create a new custom order of rounded corner stickers with a custom size matching the dimensions I mentioned earlier. Order as few or as many as you like, upload the Illustrator file, wait for them to come back to you with a proof (took a couple of hours for me) and it's job done.

I've got 2 requests for this and the first is to simply share generously. Print out a bunch and hand them out at conferences or during security talks or wherever else you like, just enjoy them and socialise them. The second request is that if you're the creative type and can do a better job of a sticker than me (highly likely), make a pull request on that GitHub repo and I'll happily accept any good ones and socialise them accordingly (feel free to drop in a readme crediting yourself too). I'd love to see what people do with these so tweet out a pic and mention me and I'll share your good work.

Along with the stickers, I'm also going to be running a live-streamed AMA at 19:00 my time tonight which will be 09:00 Tuesday morning for London, 04:00 on the US east coast and 01:00 on the west coast (so yeah... sorry US friends!) If I've set all this up right, it should magically start playing in the window below. If I haven't, keep an eye on my Twitter account and I'll communicate whatever other plans I make there.

Thank you everyone who's been a part of this journey from the supporters to the contributors to the pwned. In a week where we've just had news of both the Marriott / Starwood and Quora breaches, clearly the need for HIBP is greater than ever and I look forward to continuing to run it well into the future.