Well that was a hell of a week of travel. Seriously, the Denver situation was just an absolute mess but when looking at the video from the day I was meant to fly in, maybe being stuck in LA wasn't such a bad thing after all:
As of 1:30 p.m., all runways are closed, but the terminal & concourses are open. Airlines have cancelled flights for early afternoon/evening. Conditions on Peña Blvd. are poor; visibility is extremely low, conditions are icy. Consider the @RideRTD A Line when traveling to DEN. pic.twitter.com/AvGxVcZgeP
But despite the dramas I did still (just) make it and got to do my talk so as close as it was, I'm still yet to miss one. This week I'm talking about a bunch of different travel things, upcoming events, data breaches and those ridiculous bloody cookie warnings everyone hates so much. Next week I'll be in Seattle and will probably also be pushing the update out a little late, but I will still be pushing it out. Until then, here's the week that was:
The reason I don't know if it makes it better or worse is that on the one hand, it's ridiculous that in a part of the world that's more privacy-focused than most it essentially boils down to "take this cookie or no access for you" whilst on the other hand, the Dutch DPA somehow thinks that this makes any sense to (almost) anyone:
And the Dutch DPA’s guidance makes it clear internet visitors must be asked for permission in advance for any tracking software to be placed — such as third-party tracking cookies; tracking pixels; and browser fingerprinting tech — and that that permission must be freely obtained. Ergo, a free choice must be offered.
Is this really what we want? To continue chucking up cookie warnings to everyone and somehow expecting them to make an informed decision about the risks they present? 99% of people are going to click through them anyway (note: this is a purely fabricated figure based on the common-sense assumption that people will generally click through anything that gets in the way of performing the task they set out to complete in the first place). And honestly, how on earth is your average person going to make an informed decision on a message like this:
Do you know how hard it is to explain OAuth to technical people, let alone the masses? Oh wait - it's not OAuth - it's Oath but even I didn't get that at first because nobody really reads these warnings anyway! And now that I have read it and I know it's Oath, what does that really mean? Oh look, a big blue button that will make it all go away and allow me to do what I came here for in the first place...
But say you are more privacy focused and you wanted to follow that link in the original tweet. Here's your fix:
And if you're smart enough to actually understand what cookies are and be able to make an informed decision when prompted with a warning like TechCrunch's, then you're smart enough to know how to right click on a link and open it incognito. Or run an ad blocker. Or something like a Pi-hole.
Or you move to Australia because apparently, we don't deserve the same levels or privacy down here. Or have I got that back to front and Europeans don't deserve the same slick UX experience as we get down here? You know, the one where you click on a link to read an article and you actually get to read the article!
So let's be European for a moment and see how that experience looks - let's VPN into Amsterdam and try to control my privacy on TechCrunch:
Are you fucking serious? This is what privacy looks like? That's 224 different ad networks that are considered "IAB Partners" (that'd be the Interactive Advertising Bureau) and I can control which individual ones can set cookies. And that's in addition to the 10 Oath foundational partners:
And the ridiculous thing about it is that tracking isn't entirely dependent on cookies anyway (and yes, I know the Dutch situation touched on browser fingerprinting in general too). Want to see a perfect example? Have a go of Am I Unique and you'll almost certainly be told that "Yes! You can be tracked!":
Over one million samples collected and yet somehow, I am a unique snowflake that can be identified across requests without a cookie in sight. How? Because even though I'm running the current version of Chrome on the current version of Windows, less than 0.1% of people have the same user agent string as me. Less than 0.1% of people also have their language settings the same as mine. Keep combining these unique attributes and you have a very unique fingerprint:
The list goes on well beyond that screen grab too - time zone, screen resolution and even the way the canvas element renders on the page. It's kinda cool in a kinda creepy way.
And here's the bit that really bugs me (ok, it all bugs me but this is the worst): how do we expect your normal everyday person to differentiate between cookie warnings and warnings like these:
I know what these are and you probably do too by virtue of being on this blog, but do you really think most people who have been conditioned to click through the warning that's sitting between them and the content they wish to read understand the difference between this and a cookie warning? We literally have banks telling people just to ignore these warnings:
German bank @comdirect recommends to just ignore the warning about an insecure connection in their online banking app.
So in summary, everyone clicks through cookie warnings anyway, if you read them you either can't understand what they're saying or the configuration of privacy settings is a nightmare, depending on where you are in the world you either don't get privacy or you don't get UX hell, if you understand the privacy risks then it's easy to open links incognito or use an ad blocker, you can still be tracked anyway and finally, the whole thing is just conditioning people to make bad security choices. That is all.
Heaps of stuff going on this week with all sorts of different bits and pieces. I bought a massive new stash of HIBP stickers (1ok oughta last... a few weeks?), I'll be giving them out at a heap of upcoming events, I was on the Darknet Diaries podcast (which is epic!) plus there's more insights into the ShareThis data breach and the ginormous verifications.io incident. Oh - and Udemy is still pirating my content, here's the tweet if you'd like to let them know how you feel about that:
I'm not intentionally pushing these out later than usual, but events have just been such over the last few weeks that it's worked out that way. This one really is a short one though as there hasn't been a lot of newsworthy stuff going on this week, other than the new Instamics I picked up which are rather cool. The audio recording did work well (I mentioned in the video I wasn't sure if it was functioning correctly), and it's pretty damn good quality for what it is. Certainly better than my old Rhode lapel mic, but obviously not up to the standard of the Electro-Voice I use for professional recording.
Next week I expect I'll be a little more organised and have some more content but until then, here's a succinct 14 minutes worth of what's new on my side:
It was another travel week so another slightly delayed weekly update, but still plenty of stuff going on all the same. Along with a private Sydney workshop earlier on, I'm talking about some free upcoming NDC meetup events in Brisbane and Melbourne and I'd love to get a great turnout for. I've just ordered 10k more HIBP stickers to last me through upcoming events so they'll be coming with me.
In other news, there was old news appearing as new news about how hosed you are if your machine is compromised with the level of hosing extending to your password manager. This will inevitably be another one of these times where something gets blown out of proportion (and context) in some of the news headlines then we'll all go back to more sane discussions about assessing relative risks, likelihoods and impacts. There's also a very stead feed of breaches making their way into HIBP after appearing for sale on dark web marketplaces so I give a bit of an update on those as well.
All that and more this week in a slightly shorter form than usual, enjoy!
Another week, another conference. This time it was Microsoft Ignite in Sydney and as tends to happen at these events, many casual meetups, chats, beers, selfies, delivery of HIBP stickers and an all-round good time, albeit an exhausting one. That's why I'm a day late this week having finally arrived home late last night.
Moving on though, I've got a bunch of other events coming up particularly in conjunctions with the folks at NDC. Brisbane in a couple of weeks, Gold Coast in April then Minnesota in May. Oh - plus Oslo in June and stretching out beyond that, Sydney in October. The link in the references below about how conferences can help keep speakers happy (or piss them off, as it may be), explains why I keep doing these events. All that plus more data breach news and my thoughts on the subsequent lists of credential stuffing data.
A race to the bottom is a market condition in which there is a surplus of a commodity relative to the demand for it. Often the term is used to describe labour conditions (workers versus jobs), and in simple supply and demand terms, once there's so much of something all vying for the attention of those consuming it, the value of it plummets.
On reflecting over the last 3 and a half weeks, this is where we seem to be with credential stuffing lists today and I want to use this blog post to explain the thinking whilst also addressing specific questions I've had regarding Collections #2 through #5.
The 773 Million Record "Collection #1" Data Breach
On Thursday 17 Jan, I loaded 773M records into Have I Been Pwned (HIBP) which I titled "Collection #1". I explained how this data originated from multiple different sources and was likely obtained over a period of many years before being amalgamated together and passed around as one massive stash. There were 2.7B rows of email addresses and passwords in total, but only 1.6B them were unique (my own identical record appeared half a dozen times). In other words, there was a huge amount of redundancy.
I made the call to load the data into HIBP based primarily on 3 facts:
The data was sufficiently unique: more than 18% of the email addresses had not been seen in HIBP before
The data was in broad circulation: multiple parties had contacted me and passed on Collection #1
There was a large number of previously unseen passwords: of the 21M unique ones, half of them weren't already in HIBP's Pwned Passwords
Being conscious that there would be many questions about this data and that the origins and impact of it could be easily misrepresented, I carefully detailed every important fact. I pushed the blog post out on that Thursday morning my time and later that day, hopped on a plane to Europe. As the rest of world woke up to the story, all hell broke loose. I have never, ever received so many emails, tweets, blog comments and every other form of communication you can imagine in such a short period of time. I'd also never seen so much traffic on HIBP:
A week after the start of unprecedented traffic levels on @haveibeenpwned, I thought I'd share some stats on volumes and how everything performed, beginning with the total number of users to the site: pic.twitter.com/WAGzOTwNxx
I spent a significant part of the flight chewing through Emirates' bandwidth just responding to messages. I landed in Oslo, met friends and drove up into the mountains for a snowboarding trip with the flood of communications continuing. Jet lagged, overwhelmed by it all and frankly, just wanting downtime with good company, I turned on the out of office, closed comments on the blog post and almost completely stopped engaging on Twitter. (Side note: Scott Helme and I talked about burnout in my weekly update from London, in part due to the experiences I had dealing with the above.)
If I'm honest, that experience with the flood of communication coupled with disconnecting from life for a few days in a remote cabin with friends had a profound effect on me in many ways. I'm sure I'll talk more about them in future, but one was that I've very consciously reduced my engagement on email and Twitter frankly, to save my sanity. That's a bit tangential here though, back to Collection #1.
I'm frustrated about the hyperbole this incident managed to attract. The mass media picked it up with gusto and it made headlines all around the world in the most mainstream of publications. Inevitably, whether deliberately for the headlines or accidentally because it's simply not the world they live in, the truth was stretched time and time again. Despite my best efforts to report everything I knew with candour, things got out of control. For the most part I ignored this, only occasionally venting my frustration as someone brought it to the fore:
There were more than 3k words in that blog post detailing every single thing I knew about the data, what specifically do you think was missing?
Of course, there was nothing missing from the post and each time I asked the question it was met with silence. (Incidentally, Lorenzo who wrote that Motherboard piece is a top-notch infosec journo I've worked with many times before and he reported accurately in that piece.) I'm sharing this because I want to ensure that those who expressed their dismay at the way this story unfolded understand that it bugged the hell out of me too.
But I will say this: because this incident reached an unprecedented number of people and gained such worldwide traction, the impact of it on normal, everyday people's behaviour was significant. They learned about the phenomenon that is data breaches and credential stuffing lists, they read about password managers and 2FA and inevitably, many of them subsequently made behavioural changes to their security practices. Over-inflated headlines or not, the outcome of this on everyday consumers was positive.
The Other Collections
When I was originally contacted about Collection #1, that was the extent I knew of this series - that there was 1 collection. But very quickly it became apparent that it was merely the first of 5 collections and it was far from the biggest. Collection #1 was 87GB of data but collections #2 through #5 totalled another 845GB on top of that. Instead of the 2.7B rows from the Collection #1, the headlines were now talking about 25B which, admittedly, is quite the catchy title. Dozens of people reached out to me with links to the additional data and indeed, the media lapped up news of the larger collections as well. Inevitably, I got bombarded with questions about the subsequent collections:
Keeping in mind my previous comments about overwhelming amounts of communication and workload, the thought of processing a 10x volume of data over Collection #1 wasn't exactly exciting me. Nevertheless, I grabbed the additional collections whilst travelling, flew home just over a week ago and began analysis. Before doing that, I had a working theory that the subsequent collections would be more of the same, but I wanted hard numbers on it so I began running the data against the existing 6.5B records in HIBP.
Spam, Spam, Spam Everywhere
Back when I originally began looking at Collection #1, one of the first things I did was to run a sample selection of email addresses against HIBP to get a sense of how many of them were unique. As mentioned earlier, it turned out to be just over 18% which was quite significant for such a large list. The very first thing I did with collections #2 through #5 was to choose slices of the data and check them against HIBP. This meant choosing a random file from amongst the 85k+ in the data, extracting all the email addresses then grabbing a random 100 sample and looking for uniqueness. After checking hundreds of files, here's when I found:
Tested 457 files, 280 were a 100% match
Tested 44,426 addresses, found 5,282 unique ones not already in HIBP, only 11.89 % unique
For the sake of transparency, I've published the complete output of this process which shows just how much crossover there is with existing data. As you scroll through that list, you'll see that over 61% of the files tested were a 100% match to HIBP; every single one of those random 100 email addresses tested was already in there. (Sidenote: after running the report, I realised that some of the source files didn't contain email addresses and as such reported "Of 0 random email addresses, 0 are already in HIBP". That's fine, but it skewed the 61% number down as the file was counted as not being an exact match.)
Some of these were quite predictable:
Collection #5\Collection #5\Dump HASH\www.babynames.com.txt
Of 100 random email addresses, 100 are already in HIBP
There's an easy explanation for that:
Then there were files at the other end of the extreme:
Collection #5\Collection #5\EU combos\49.txt
Of 100 random email addresses, 1 are already in HIBP
Curious, I took a closer look and found 100k rows heavily orientated towards Eastern European TLDs; over 20k .ua (Ukraine), another 10k .uz (Uzbekistan), 5k .kz (Kazakhstan) etc. I have no idea how many of these are actual addresses nor which breaches they originated from if they're indeed genuine, obviously there's nothing given away by the file name. The problem with all of this data (as with Collection #1), is that it's just about impossible to establish authenticity and a bunch of it is very likely not what it's represented to be.
Here's a perfect example: when running the check, one of the very first results I saw was this one:
It piqued my interest as it's an Aussie TLD for a site I'd never heard of yet apparently, 100% of the email addresses are already in HIBP. So I delved into the file and was immediately struck by the occurrence of a different TLD which, upon counting its occurrences across the 436-line file, showed a strangely high hit rate:
The file itself was then a combination of email addresses and SHA-1 hashes along with email addresses and then simply the number 1 after it. This is unusual as not only is there no consistency to the format, but it's also clearly comprised of different types of information.
During the course of the last week, I had a few chats with Vinny Troia of Night Lion Security. Vinny has supported HIBP in the past with data he's located floating around the web and we had a good discussion about the nature of these collections which he was also analysing. He also lamented the volume of garbage in them, pointing to examples such as this (the asterisks all represent the same 4-digit number):
Then there's my own data. I'd already found it in Collection #1 half a dozen times with an old throwaway password I had legitimately used many years ago. I noted it in the original blog post but didn't dig any further. This time, however, I probed deeper; I wanted context for the data.
Here it is in "Collection #2\Collection #2\DUMPS dehashed\thegioididong.com.txt"
I had to look up thegioididong.com in order to work out what it was. Turns out it's a Vietnamese e-commerce site selling phones so yeah, not exactly the sort of place I'd frequent.
And here it is in "Collection #5\DUMP dehashed\DropBox.com add pass.txt":
No, I've not screwed up the image, the file it's in is identical to the Vietnamese phone one. The password is identical too and firstly, under no circumstances did I ever use that password on Dropbox and secondly, the password I had in the Dropbox breach was randomly generated and exposed as a bcrypt hash I shared publicly when reporting on the breach.
So you see my point about "spam, spam, spam" - these collections are absolutely riddled with junk. That's not to say they don't contain legitimate usernames and passwords because quite clearly, some of them are, rather that the actual unique legitimate entries across all the collections is a small subset of what the headlines suggest.
It's a Very Deep Bottom
Following the events above, I received dozens of messages (maybe even hundreds, I honestly lost track) about other collections of credentials. Not collections represented as being part of the same series (i.e. Collection #6), but rather entirely separate sets of data. A few thousand here from a phishing page, a few hundred thousand over there in a public Google Doc, untold numbers more in pastes that HIBP may not have already indexed. I've seen a lot of breached data over a lot of years but even for me, I was honestly left a bit stunned by all of this. It. Just. Never. Ends.
A massive 600 gigabyte file containing about 2.2 billion compromised usernames and passwords has been spotted floating about the dark web, freely available to anyone who cares to download it via torrent.
In case the ASCII art is lost on you, that's "13 BILLION /EMAILS\" in a readme file accompanied by an 88GB file containing that number of email and password pairs. It was about this time that the penny finally dropped in terms of just how comedic it was becoming to have numbers that seemed both artificially large and apparently there for shock value. It's like I'd seen this somewhere before...
All of this data in all of these locations has caused me to ask some pretty fundamental questions about the point of these lists as they relate to HIBP:
What's the point of loading billions after billions of email addresses from credential stuffing lists? What makes a new list worth adding to the 6.5B addresses already in HIBP? And if I'm going to be honest with myself, what's changed since I loaded Collection #1 that would cause me not to load subsequent lists?
The answer to the last question is a combination of the frenzy that first list created coupled with the emergence of untold numbers of other lists. What's changed is that there's way more data circulating than I've ever seen before and if I go loading all of that into HIBP, I fear the signal to noise ratio will go through the floor. Some people already felt that was the case with Collection #1 and whilst I still maintain loading that list was the right thing to do in the climate of the time, a constant stream of notifications about old incidents that have merely re-purposed the same data is quickly going to create a groundswell of unhappy subscribers.
Somehow, the Collection #1 incident turned into a feeding frenzy of media, breach traders, security firms and industry voices alike, all vying for a piece of the attention. Whilst there was undoubtedly value in the awareness it created, an increasing infatuation on which list is the largest or who's sitting on the largest stash of data is just downright counterproductive. It becomes a sideshow of superlative news headlines as the discussion turns to "who's is biggest" rather than "what should we actually be doing about this".
For now, I don't see subsequent lists like these going into HIBP unless there's something sufficiently unique about them. Users of the service have a pretty good idea by now where they've been exposed and what they should do about it, I want to keep focusing on the discrete incidents that are clearly attributable back to a source. Speaking of which:
I'm back home! It was an amazing trip in many ways, not least of which was the time it gave both Scott and myself to reflect on workload and managing lives which can be a bit of a never-ending series of commitments. To that effect, I've been backing off Twitter a bit and as I say in this update, I very quickly remembered why after a couple of short engagements yesterday. But moving forward, it's Microsoft Ignite in Sydney next week and that should be a great event, plus I'm talking about Google's Password Checkup extension and the other credential stuffing list "collections" I keep getting asked about. On that last point, I explain my hesitation with them in the video so for those curious about my opinion, hopefully this helps shed some light on things.
I'm pumping this weekly update out a little bit later, pushing it just before I get on the plane back home to Australia. I've just wrapped up a week in London with Scott doing all things NDC including a couple of days of workshops and a couple of talks each. We discuss that, and how the UK seems to have an odd infatuation with doing anything that could even remotely be deemed a health and safety risk.
On a more serious note, we talk about the emotional toll of the things we do, namely the never ending charging forward with projects like Report URI and HIBP, along with the training, conference talks and what seems like a never-ending pit of emails. I really want to talk more about this in future because whilst I don't personally feel like I'm suffering from burn-out, I can see how that would be the inevitable conclusion of doing too much of this for too long. As I say in the video, I (and Scott) welcome all comments on this.
So it's been a bit of a crazy week. I got onto the plane in Australia on Thursday evening just as Europe was waking up to the news of the 773M email address credential stuffing list I loaded into HIBP. And then the flood began; blog comments, emails, tweets - it was an absolute deluge. I spent the flight fielding the ones I could, landed in Oslo and dealt with more on the way up the mountain then frankly, got there and tuned out. Out of office on, blog comments closed and tweets ignored. This trip was planned downtime with my son and good friends and I really needed it.
In this week's update, I talk about the coverage of that event with Scott Helme while sitting in Oslo during a break in our workshops. We also talked about what frankly, became a bit of a spectacle: the VLC debate about serving updates over HTTP. I'll link to that in the references below and you can hear Scott's and my thoughts on it there. Next week, we'll both be in London at the NDC conference so Scott will join me again for another update then.
And then there was the biggest data breach to go into HIBP ever! I wrote that sentence from home just after publishing all the data, then I got on a plane...
Holy cow that's a lot of emails! Hundreds upon hundreds of emails came in whilst on the way to Dubai, more than I'll ever be able to respond to. Plus, I'm actually trying to have some downtime with my son on this trip particularly over the next few days so a bunch of stuff is going to have to go unanswered or at best, delayed. Mind you, a heap of them were asking questions already addressed in the blog post, but that's just the nature of the internet.
What I will say is that if you're interested in more details on this incident, do read the comments. It'll give you a sense of the way this sort of thing impacts everyday people, and it'll also give you a sense of the sort of comments I have to deal with after these incidents...
Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1". Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.
Let's start with the raw numbers because that's the headline, then I'll drill down into where it's from and what it's composed of. Collection #1 is a set of email addresses and passwords totalling 2,692,818,238 rows. It's made up of many different individual data breaches from literally thousands of different sources. (And yes, fellow techies, that's a sizeable amount more than a 32-bit integer can hold.)
In total, there are 1,160,253,228 unique combinations of email addresses and passwords. This is when treating the password as case sensitive but the email address as not case sensitive. This also includes some junk because hackers being hackers, they don't always neatly format their data dumps into an easily consumable fashion. (I found a combination of different delimiter types including colons, semicolons, spaces and indeed a combination of different file types such as delimited text files, files containing SQL statements and other compressed archives.)
The unique email addresses totalled 772,904,991. This is the headline you're seeing as this is the volume of data that has now been loaded into Have I Been Pwned (HIBP). It's after as much clean-up as I could reasonably do and per the previous paragraph, the source data was presented in a variety of different formats and levels of "cleanliness". This number makes it the single largest breach ever to be loaded into HIBP.
There are 21,222,975 unique passwords. As with the email addresses, this was after implementing a bunch of rules to do as much clean-up as I could including stripping out passwords that were still in hashed form, ignoring strings that contained control characters and those that were obviously fragments of SQL statements. Regardless of best efforts, the end result is not perfect nor does it need to be. It'll be 99.x% perfect though and that x% has very little bearing on the practical use of this data. And yes, they're all now in Pwned Passwords, more on that soon.
That's the numbers, let's move onto where the data has actually come from.
Last week, multiple people reached out and directed me to a large collection of files on the popular cloud service, MEGA (the data has since been removed from the service). The collection totalled over 12,000 separate files and more than 87GB of data. One of my contacts pointed me to a popular hacking forum where the data was being socialised, complete with the following image:
As you can see at the top left of the image, the root folder is called "Collection #1" hence the name I've given this breach. The expanded folders and file listing give you a bit of a sense of the nature of the data (I'll come back to the word "combo" later), and as you can see, it's (allegedly) from many different sources. The post on the forum referenced "a collection of 2000+ dehashed databases and Combos stored by topic" and provided a directory listing of 2,890 of the files which I've reproduced here. This gives you a sense of the origins of the data but again, I need to stress "allegedly". I've written before about what's involved in verifying data breaches and it's often a non-trivial exercise. Whilst there are many legitimate breaches that I recognise in that list, that's the extent of my verification efforts and it's entirely possible that some of them refer to services that haven't actually been involved in a data breach at all.
However, what I can say is that my own personal data is in there and it's accurate; right email address and a password I used many years ago. Like many of you reading this, I've been in multiple data breaches before which have resulted in my email addresses and yes, my passwords, circulating in public. Fortunately, only passwords that are no longer in use, but I still feel the same sense of dismay that many people reading this will when I see them pop up again. They're also ones that were stored as cryptographic hashes in the source data breaches (at least the ones that I've personally seen and verified), but per the quoted sentence above, the data contains "dehashed" passwords which have been cracked and converted back to plain text. (There's an entirely different technical discussion about what makes a good hashing algorithm and why the likes of salted SHA1 is as good as useless.) In short, if you're in this breach, one or more passwords you've previously used are floating around for others to see.
So that's where the data has come from, let me talk about how to assess your own personal exposure.
Checking Email Addresses and Passwords in HIBP
There'll be a significant number of people that'll land here after receiving a notification from HIBP; about 2.2M people presently use the free notification service and 768k of them are in this breach. Many others, over the years to come, will check their address on the site and land on this blog post when clicking in the breach description for more information. These people all know they were in Collection #1 and if they've read this far, hopefully they have a sense of what it is and why they're in there. If you've come here via another channel, checking your email address on HIBP is as simple as going to the site, entering it in then looking at the results (scrolling further down lists the specific data breaches the address was found in):
But what many people will want to know is what password was exposed. HIBP never stores passwords next to email addresses and there are many very good reasons for this. That link explains it in more detail but in short, it poses too big a risk for individuals, too big a risk for me personally and frankly, can't be done without taking the sorts of shortcuts that nobody should be taking with passwords in the first place! But there is another way and that's by using Pwned Passwords.
This is a password search feature I built into HIBP about 18 months ago. The original intention of it was to provide a data set to people building systems so that they could refer to a list of known breached passwords in order to stop people from using them again (or at least advise them of the risk). This provided a means of implementing guidance from government and industry bodies alike, but it also provided individuals with a repository they could check their own passwords against. If you're inclined to lose your mind over that last statement, read about the k-anonymity implementation then continue below.
Here's how it works: let's do a search for the word "P@ssw0rd" which incidentally, meets most password strength criteria (upper case, lower case, number and 8 characters long):
Obviously, any password that's been seen over 51k times is terrible and you'd be ill-advised to use it anywhere. When I searched for that password, the data was anonymised first and HIBP never received the actual value of it. Yes, I'm still conscious of the messaging when suggesting to people that they enter their password on another site but in the broader scheme of things, if someone is actually using the same one all over the place (as the vast majority of people still do), then the wakeup call this provides is worth it.
As of now, all 21,222,975 passwords from Collection #1 have been added to Pwned Passwords bringing the total number of unique values in the list to 551,509,767.
Whilst I can't tell you precisely what password was against your own record in the breach, I can tell you if any password you're interested in has appeared in previous breaches Pwned Passwords has indexed. If one of yours shows up there, you really want to stop using it on any service you care about. If you have a bunch of passwords and manually checking them all would be painful, give this a go:
This is 1Password's Watchtower feature and it can take all your stored passwords and check them against Pwned Passwords in one go. The same anonymity model is used (neither 1Password nor HIBP ever see your actual password) and it enables bulk checking all in one go. I'm conscious that many people reading this won't be using a password manager of any kind in the first place and that's an absolutely pivotal part of how to deal with this incident so I'll come back to that a little later. Apparently, this feature along with integrated HIBP searches and notifications when new breaches pop up is one of the most-loved features of 1Password which is pretty cool! For some background on that, without me knowing in advance, they launched an early version of this only a day after I released V2 with the anonymity model (incidentally, that was a key motivator for later partnering with them):
For those using Pwned Passwords in their own systems (EVE Online, GitHub, Okta et al), the API is now returning the new data set and all cache has now been flushed (you should see a very recent "last-modified" response header). All the downloadable files have also been revised up to version 4 and are available on the Pwned Passwords page via download courtesy of Cloudflare or via torrents. They're in both SHA1 and NTLM formats with each ordered both alphabetically by hash and by prevalence (most common passwords first).
Why Load This Into HIBP?
Every single time I came across a data set that's not clearly a breach of a single, easily identifiable service, I ask the question - should this go into HIBP? There are a number of factors that influence that decision and one of them is uniqueness; is this a sufficiently new set of data with a large volume of records I haven't seen before? In determining that, I take a slice of the email addresses and ran them against HIBP to see how many of them had been seen before. Here's what it looked like after a few hundred thousand checks:
In other words, there's somewhere in the order of 140M email addresses in this breach that HIBP has never seen before.
The data was also in broad circulation based on the number of people that contacted me privately about it and the fact that it was published to a well-known public forum. In terms of the risk this presents, more people with the data obviously increases the likelihood that it'll be used for malicious purposes.
Then there's the passwords themselves and of the 21M+ unique ones, about half of them weren't already in Pwned Passwords. Keeping in mind how this service is predominantly used, that's a significant number that I want to make sure are available to the organisations that rely on this data to help steer their customers away from using higher-risk passwords.
And finally, every time I've asked the question "should I load data I can't emphatically identify the source of?", the response has always been overwhelmingly "yes":
People will receive notifications or browse to the site and find themselves there and it will be one more little reminder about how our personal data is misused. If - like me - you're in that list, people who are intent on breaking into your online accounts are circulating it between themselves and looking to take advantage of any shortcuts you may be taking with your online security. My hope is that for many, this will be the prompt they need to make an important change to their online security posture. And if you find yourself in this data and don't feel there's any value in knowing about it, ignore it. For everyone else, let's move on and establish the risk this presents then talk about fixes.
What's the Risk If My Data Is in There?
I referred to the word "combos" earlier on and simply put, this is just a combination of usernames (usually email addresses) and passwords. In this case, it's almost 2.7 billion of them compiled into lists which can be used for credential stuffing:
Credential stuffing is the automated injection of breached username/password pairs in order to fraudulently gain access to user accounts.
In other words, people take lists like these that contain our email addresses and passwords then they attempt to see where else they work. The success of this approach is predicated on the fact that people reuse the same credentials on multiple services. Perhaps your personal data is on this list because you signed up to a forum many years ago you've long since forgotten about, but because its subsequently been breached and you've been using that same password all over the place, you've got a serious problem.
By pure coincidence, just last week I wrote about credential stuffing attacks and how they led many people to believe that Spotify had suffered a data breach. In that post, I embedded a short video that shows how easily these attacks are automated and I want to include it again here:
Within the first 15 seconds, the author of the video has chosen a combo list just like the one three quarters of a billion people are in via this Combination #1 breach. Another 30 seconds and the software is testing those accounts against Spotify and reporting back with email addresses and passwords that can logon to accounts there. That's how easy it is and also how indiscriminate it is; it's not personal, you're just on the list! (For people wanting to go deeper, check out Shape Security's video on credential stuffing.)
To be clear too, this is not just a Spotify problem. Automated tools exist to leverage these combo lists against all sorts of other online services including ones you shop at, socialise at and bank at. If you found your password in Pwned Passwords and you're using that same one anywhere else, you want to change each and every one of those locations to something completely unique, which brings us to password managers.
Get a Password Manager
You have too many passwords to remember, you know they're not meant to be predictable and you also know they're not meant to be reused across different services. If you're in this breach and not already using a dedicated password manager, the best thing you can do right now is go out and get one. I did that many years ago now and wrote about how the only secure password is the one you can't remember. A password manager provides you with a secure vault for all your secrets to be stored in (not just passwords, I store things like credit card and banking info in mine too), and its sole purpose is to focus on keeping them safe and secure.
A password manager is also a rare exception to the rule that adding security means making your life harder. For example, logging on to a mobile app is dead easy:
I chose the password manager 1Password all those years ago and have stuck with it ever it since. As I mentioned earlier, they partnered with HIBP to help drive people interested in personal security towards better personal security practices and obviously there's some neat integration with the data in HIBP too (there's also a dedicated page explaining why I chose them).
If a digital password manager is too big a leap to take, go old school and get an analogue one (AKA, a notebook). Seriously, the lesson I'm trying to drive home here is that the real risk posed by incidents like this is password reuse and you need to avoid that to the fullest extent possible. It might be contrary to traditional thinking, but writing unique passwords down in a book and keeping them inside your physically locked house is a damn sight better than reusing the same one all over the web. Just think about it - you go from your "threat actors" (people wanting to get their hands on your accounts) being anyone with an internet connection and the ability to download a broadly circulating list Collection #1, to people who can break into your house - and they want your TV, not your notebook!
Because an incident of this size will inevitably result in a heap of questions, I'm going to list the ones I suspect I'll get here then add to it as others come up. It'll help me handle the volume of queries I expect to get and will hopefully make things a little clearer for everyone.
Q. Can you send me the password for my account? I know I touched on it above but it's always the single biggest request I get so I'm repeating it here. No, I can't send you your password but I can give you a facility to search for it via Pwned Passwords.
Q. How long ago were these sites breached? It varies. The first site on the list I shared was 000webhost who was breached in 2015, but there's also a file in there which suggests 2008. These are lots of different incidents from lots of different time frames.
Q. I'm responsible for managing a website, how do I defend against credential stuffing attacks? The fast, easy, free approach is using the Pwned Passwords list to block known vulnerable passwords (read about how other large orgs have used this service). There are services out there with more sophisticated commercial approaches, for example Shape Security's Blackfish (no affiliation with myself or HIBP).
Q. How can I check if people in my organisation are using passwords in this breach? The entire Pwned Passwords corpus is also published as NTLM hashes. When I originally released these in August last year, I referenced code samples that will help you check this list against the passwords of accounts in an Active Directory environment.
Q. I'm using a unique password on each site already, how do I know which one to change? You've got 2 options if you want to check your existing passwords against this list: The first is to use 1Password's Watch Tower feature described above. If you're using another password manager already, it's easy to migrate over (you can get a free 1Password trial). The second is to check all your existing passwords directly against the k-anonymity API. It'll require some coding, but's its straightforward and fully documented.
Q. Is there a list of which sites are included in this breach? I've reproduced a list that was published to the hacking forum I mentioned and that contains 2,890 file names. This is not necessarily complete (nor can I easily verify it), but it may help some people understand the origin of their data a little better.
Q. Will you publish the data in collections #2 through #5? Until this blog post went out, I wasn't even aware there were subsequent collections. I do have those now and I need to make a call on what to do with them after investigating them further.
Q. Where can I download the source data from? Given the data contains a huge volume of personal information that can be used to access other people's accounts, I'm not going to direct people to it. I'd also ask that people don't do that in the comments section.
Well, it's one more sunny weekly update then snow time again so I've gone particularly beachy today. I'm also particularly breachy, talking about a massive combo list I'm presently pondering for inclusion in HIBP. These lists are frequently used for account takeover attacks against the likes of Spotify which is the subject of this week's blog post. Plus, I'm talking a bit about a bunch of Ubiquiti bits I'll be installing soon to fix the problem seen below:
Oh - and I did end up heading out on the water with Kevin Mitnick, albeit on the boat. I think it's alright. Maybe...
Time and time again, I get emails and DMs from people that effectively boil down to this:
Hey, that paste that just appeared in Have I Been Pwned is from Spotify, looks like they've had a data breach
Many years ago, I introduced the concept of pastes to HIBP and what they essentially boil down to is monitoring Pastebin and a bunch of other services for when a trove of email addresses is dumped online. Very often, those addresses are accompanied by other personal information such as passwords. When an HIBP subscriber's address appears in one of these incidents, they get an automated notification and often, it seems, they then reach out to me.
Here's a perfect example of what I'm talking about, this one eventually triggering an email to me just last week:
Let's imagine you're the first person on the list; you get a notification from HIBP, you check out the paste and see your Hotmail account listed there alongside your Spotify password and the plan you're subscribed to. Clearly a Spotify breach, right?
No, and the passwords are the very first thing that starts to give it all away. Just looking at them, they're obviously terrible, but plugging the first one into Pwned Passwords give you a sense of just how terrible it is:
They may not all be that bad (the next one in the list has only been seen twice), but the point is that it's a password that's clearly been seen before and were I to dig back into the source data, there's a good chance it's been seen in a breach alongside that email address too. Then there's the fact that the password is in plain text and I don't know precisely how Spotify store their passwords, but it'd be a very safe bet that by now it's a decent modern-day hashing algorithm. If they had a breach then yes, hashes may be cracked, but that's not what's happening here.
We're simply seeing the successful result of credential stuffing attacks. Regular readers will appreciate the mechanics of this already but all those who I point here for whom this is new, this attack simply takes exposed credentials from a data breach and tries them on another site. The attack is simple but effective due to the prevalence of password reuse. If you were using the same password on LinkedIn when they had their data breach as you are on Spotify today and someone grabbed that password from the breach and tried it on Spotify, you can see the problem. That's it, job done, they're into your account.
Spotify "breaches" like this are enormously common. I just went and looked at the pastes HIBP has collected since the clock ticked over to 2019 and found 20 of them already:
Digging further, I found over a thousand pastes with "Spotify" in the title. These are often removed by Pastebin pretty quickly but looking through some that remain, it's precisely the same pattern as the earlier example. I grabbed a random email address out of one of them and checked it on HIBP:
The same address appears over and over in pastes and each time, the same password appears alongside it. Picking one from the list above that hasn't yet been removed shows a page full of examples like this (with a password Pwned Passwords has seen 4 times before):
This one is interesting for a couple of reasons and the first is the use of the term "combo". I've written about combo lists before and they're essentially combinations of email addresses and passwords used to test against services in credential stuffing attacks. Thousands. Millions. Billions of them, in some cases. The second interesting observation in that image is the "Spotify Cracker" reference. The first Google result for the term shows a popular cracking forum with the following image (password seen 447 times in Pwned Passwords):
This is a tool for breaking into Spotify accounts I wouldn't normally link through to content of that type, but context is important. For people wondering why they're getting alerts from HIBP because their Spotify account is in a paste somewhere, have a flick through some of those pages. 61 of them at the time of writing, each with 20 posts thanking the OP for their work in order to get access to the tool. So what does it do? Have a quick watch of this:
It's a slightly different piece of software based on what's visible, but the objective is the same and the premise is simple: download the tool, pass in the combo list then let it run. Credentials from the list are then tested against Spotify (yes, security friends, there's a very good question to be asked here as to why this is still possible...) and results appear on the screen.
Now, this isn't to say that someone who finds their Spotify account on one of these lists shouldn't worry because it wasn't a breach per se. Instead, they need to look inwardly and adjust their own security practices instead. Get a password manager (8 years on and I still use 1Password every day), create strong and unique passwords on every account and enable 2-factor authentication where available. Well, except that there's still no 2FA support on Spotify so just enable it on every other service that supports it (and most big ones do these days).
And why would someone "hack" (I use the term loosely because they literally logged in with the correct username and password) Spotify accounts? The obvious answer is that they have a monetary value, but I also posit that it's very often just curiosity driving this behaviour. Take a look at a video such as this SQL injection tutorial; I've used it in talks before to illustrate the randomness of attacks as well as the sophistication of those behind many of them. Is the person in this video an evil cyber hacker hell-bent on causing chaos, or just a curious kid whose moral compass is yet to be properly calibrated? That may not make Spotify users feel any better about the end result, but it's important context for this post.
In doing a bit of searching for this piece I found heaps of results for "spotify data breach" that led to discussions highlighting what I've covered above. For example, this one from August on the Spotify community site where the original post begins with:
Someone had access to my pasword [sic] (which is totally unbreakable and diferent [sic] from the one i use in other accounts)
I don't know what their password was, but I do know that I've had dozens of discussions with people making precisely the same claims only to discover "their" password is in Pwned Passwords a few hundred times! Or they entered it into a phishing site somewhere. If we apply Occam's Razor to this (the simplest solution is the most likely one), the password was compromised. I want to illustrate this point via the following Tweet:
For ref, here are the details on my 1Password entry for Pinterest. Definitely the strong, unique one I showed in my tweet. pic.twitter.com/d3sSR8PCu1
This is Scott Helme, a world-renowned security researcher who understands these concepts as well as anyone I can imagine. This tweet is part of a broader discussion where his Pinterest account was logged into by an unknown party and per the image above, Scott was convinced his password was both strong and unique. A couple of hours later, Scott's view is, well, somewhat "different":
Just goes to show, it's sometimes easy to miss these things! I'm now wondering how many other old accounts I have lurking around out there... 🤔 5/5
I spoke to Scott about this incident again whilst writing this post and we both reflected on just how easy it is to have issues like this, even you're convinced your security is spot on. It's precedents like this which cause me to pause and question every strongly made claim of personal security prowess in the wake of examples such as the Spotify community one above.
Reading through that thread only reinforces the view that this was a simple account takeover issue and not a sophisticated hack. For example, this comment:
It's such a shame to see Spotify blaming its users for getting hacked instead of fixing the problem. Got my playlists deleted and the hacker created a playlist called "Get Hacked".
Imagine you're a hacker - a real one with the capabilities to break into a company with hundreds of millions of users and worth billions of dollars - what are you going to do? Are you just going to mess with people's playlists "for the lulz"? No, at the very least you're going to cash in on their public bug bounty or if you're really the malicious type, you're going to monetise their users in a much more surreptitious fashion.
Scroll down a little further and someone is referencing HIBP as "proof" of a hack. Here's what happened to the guy's account:
I got a notification from haveibeenpwned.com and did nothing about it until some random kept playing weird music on a device I did not recognize while I was trying to listen on my normal device. It was annoying, I kept getting pulled out of my song because we started battling for control of what device and what song the audio was to be heard on. I started playing really loud and obnoxious noise music for the hacker while I changed my password.
Now again, let's apply Occam's Razor: is this an elite hacker who's discovered some previously unknown zero-day vulnerability, or someone who's exploited the victim's password and then simply has a different taste in music?
The community thread references a paste titled "Más de 300 cuentas premium de Spotify" ("More than 300 Spotify premium accounts") which has since been deleted from Pastebin (and HIBP doesn't save the contents beyond just the email addresses). But 4 days earlier there was a paste titled "Más de 50 cuentas premium de spotify" which still stands today and its content lines up very closely with the others discussed above; it's simply the output of another automated tool exploiting weak credentials.
I'll end on one final point because if I don't, it'll come through in the comments anyway: online security is a shared responsibility. Some people are quick to play the "victim blaming" card when I write about incidents that can be traced back to weak security practices. Clearly, that's not causing me to sugar-coat the root cause of these incidents but that said (and I touched on this earlier), this is prevalent enough that Spotify also needs to look internally at why this is still occurring. Their job is to stop this form of attack at the platform level and our job as users of the service is to protect our accounts via some basic security practices.
So no, Spotify wasn't hacked, they just allowed malicious parties to log in with other people's poor passwords.
And then it was 2019. Funny how quickly it gets away from you, someone just posted on my 2018 retrospective blog post this week and asked why I didn't include my congressional testimony and if I'm honest, it took me a bit to think about why as well (it was in 2017). But we're here now so it's back to business as usual blog wise.
This week is dominated by the personal finance lessons blog post. This has gotten massive traction this week and has been read by tens of thousands of people. But perhaps what surprises me most is that out of all the feedback I've had, there's only been one negative comment. O-n-e. Frankly, I'm not even sure he actually absorbed the content as the comment was very specifically addressed in the post, but that forms one little part of everything I cover in this week's update. I also touch on the aforementioned 2018 retrospective which I've been doing these last few years as a little reminder of what I've been up to.
This is (probably?) the longest weekly update I've done so far and I do hope it helps add a bit more personality and context to that finance blog post. Do please continue to share feedback and ask questions, I've really enjoyed seeing people get motivated by it.
I started doing these retrospectives 3 years ago in my first year of independence. I reckon they're a good thing for everyone to do if not in written form then at least mentally to look back on your achievements of the year. They're a great way of reflecting on success (and indeed, on failures) and they also help explain why we all feel so damn tired by the end of the year!
Here's my 2018 highlights, starting with travel:
"Oh yeah, I'm totally gonna travel less this year" - me every single year
In reality, my travel ended up looking like this:
That's the same number as last year, 4 more days and another 8,000km. On the other hand, it's 12 less cities and 1 less country and the main reason for that is I've been trying to cram less into trips. I've also been travelling with family far more so whilst those 140 days equate to 38% of my year, there were 14 days in Hawaii, 10 days at the Aussie snow, 11 days in Texas and 17 days in Canada where I wasn't flying solo. That's 52 days where it wasn't just a lonely slog so I'm pretty happy about that.
I actually got a bit of a surprise when I pulled the list of my most popular blog posts for 2018:
The surprise was that after the home page, the most popular page hit on my site was the one about Online Spambot, a post I published in August 2017. I guess it's maintained its traction due it being referenced in the HIBP description and there being a huge number of people finding themselves pwned. In fact, I'm sure that's why the next 3 blog posts are up there too because they're all from similar incidents (number 6 in that list was also from 2017).
If I'm honest though, my favourite post of the year was the one I published earlier this week on New Year's Eve - 10 Personal Finance Lessons for Technology Professionals. I love this post. I love the reaction it's had. I love that based on so much of the positive feedback I've had it might actually improve people's lives in away I don't think any previous post has before. Who knows, maybe this is something I'll even write more about in 2019 if there's an appetite.
The sponsorship model continued strongly too. It's been resoundingly well-received by both browsers to the site and the sponsors themselves and I've already booked 2019 out until August.
Geez, where to start... Probably with my 2018 events page which lists everything I did of a public nature. What it doesn't do is list all the private events which pretty dramatically increases that list. Of the ones I can talk about, they included:
How safe is your #password?! I hope you tuned into #NETUG tonight to find out from security expert @troyhunt!! A huge thank you to Troy for presenting for us tonight. If you missed out, never fear, I’m sure SSW TV will post the video in the coming week! pic.twitter.com/jtHlTyJ9l6
Troy Hunt @troyhunt speaking at his best, he shows that not all hackers come with hoodies and green command line screens, and how security is taken so lightly at even the bigger enterprises. #APIdaysAUpic.twitter.com/19lb65Xhkm
#Sibos Big Issue Debate @troyhunt at Hacker: I have an objection to the thought that increasing security makes things harder. There are technologies that achieve both objectives. We need to help people understand that the technologies are there and use them effective. pic.twitter.com/XaYneE72fJ
The positive stories are the ones you don't see here; the ones that are no longer on the list. Site like the ABC in Australia, the Daily Mail in the UK and Roblox in the US. They're the largest sites in their respective countries to drop off the list and there have been many, many more in the same boat. I've actually had developers from many organisations reach out requesting that the list be refreshed just so their site drops off. Shaming works in powerful ways 🙂
HTTPS is Easy
I didn't want to just shame organisations doing the wrong thing, I also wanted to help everyone get better at HTTPS. After all, HTTPS is easy, so I built HTTPS Is Easy:
This became a great 4-part reference series with 5-minute videos which live up to the title. I'm enormously happy with how it was received, and frankly a bit overwhelmed that the community stepped up and translated it into 19 different languages including: Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Indonesian, Italian, Norwegian, Persian, Polish, Portuguese, Russian, Slovenian, Spanish and Swedish. That's pretty awesome!
Have I Been Pwned
Geez, were to start on this one... In point form:
Added 76 new data breaches
Which encompassed 829,391,906 additional records
Signed up 445,720 new subscribers
Sent 1,224,377 breach notification emails to them
And sent another 239,277 notifications to those monitoring domains
And probably 100 other things that should be in a retrospective but just flew by in a blur! But there was another aspect of HIBP which really took off in 2018 and it deserves its own heading:
When I launched version 2 in Feb, this service really started to get traction. The k-anonymity model courtesy of Cloudflare was the real killer feature and a special mention goes to Junade Ali on that:
If you don't know the back-story on those lava lamps, this is a fun vid that's only a few minutes long:
Back to Pwned Passwords, the premise of checking to see if a password has been previously breached before allowing someone to use it has gained a lot of traction. I've seen dozens of use cases first hand (and there's probably hundreds I'll never know about), with EVE Online being the first big one:
Here is the message we display to our users during login if we find their password in the @haveibeenpwned list. We try to explain calmly what is wrong and how to fix it. Provide links and guidance. pic.twitter.com/9rc8svI0yS
Okta built an absolutely awesome browser extension:
Why I like @Okta's Passprotect plugin: if sites use outdated password complexity rules and tell people that "P@ssword1" is a good password, the plugin still warns the password is unsafe. It uses @troyhunt's pwned passwords and the k-anonimity model to securely check the password pic.twitter.com/wBFWMzrtra
Of those who do consume the k-anonymity API, I'm usually serving up somewhere between 4 and 6 million requests a day:
There were a couple of cache flushes in there but just to give you a sense of how well optimised the service is to serve content directly from Cloudflare's edge nodes and not hit the origin server, here's the last week:
That's a 99% cache hit ratio 😎
Scott wrote a year in review piece this week so I'll defer to his overview for that but in short: heaps of new reporting types, a wizard that makes creating CSPs way easier, the launch of Report URI JS, heaps of both free subscriber and commercial customer growth and we're also pushing a few reports through these days too:
But the highlight - without a doubt in my mind - is covered in this next section:
What. A. Year! In fact, what a couple of weeks and it all began with AusCERT's Award for Information Security Excellence, presented in my home town:
I'm 2 weeks to the day out from heading back to Europe so the whole show starts again very soon. In many ways, 2019 will be more of the same but in other ways, there's a bunch of new things on the horizon. I've already committed to events in 3 new places I've never been before in the first half of the year so that'll be cool.
Beyond that, I honestly don't know. I have a view about 6 months out around travel commitments but the nature of this industry and indeed the role I play today is that I have absolutely no idea what will pop up overnight, let alone further along into 2019. But that's ok, it keeps things entertaining 🙂