Category Archives: have I been pwned?

Welcoming the Irish Government to Have I Been Pwned

Welcoming the Irish Government to Have I Been Pwned

Over the last year and a bit I've been working to make more data in HIBP freely available to governments around the world that want to monitor their own exposure in data breaches. Like the rest of us, governments regularly rely on services that fall victim to attacks resulting in data being disclosed and just like the commercial organisations monitoring domains on HIBP, understanding that exposure is important. To date, the UK, Australian, Spanish and Austrian governments have come onboard HIBP for nation-wide government domain monitoring and today, I'm happy to welcome the Irish government as well. They now have access to all .gov.ie domains and a handful of other government ones on different TLDs.

A big welcome to the Irish National Cyber Security Centre!

Authentication and the Have I Been Pwned API

Authentication and the Have I Been Pwned API

The very first feature I added to Have I Been Pwned after I launched it back in December 2013 was the public API. My thinking at the time was that it would make the data more easily accessible to more people to go and do awesome things; build mobile clients, integrate into security tools and surface more information to more people to enable them to do positive and constructive things with the data. I highlighted 3 really important attributes at the time of launch:

There is no authentication.

There is no rate limiting.

There is no cost.

One of those changed nearly 3 years ago now - I had to add a rate limit. The other 2 are changing today and I want to clearly explain why.

Identifying Abusive API Usage

Let me start with a graph:

Authentication and the Have I Been Pwned API

This is executions of the V2 API that enables you to search an individual email address. There's 1.06M requests in that 24 hour period with 491k of them in the last 4 hours. Even with the rate limit of 1 request every 1,500ms per IP address enforced, that graph shows a very clear influx of requests peaking at 14k per minute. How? Well let's pull the logs from Cloudflare and see:

Authentication and the Have I Been Pwned API

This is the output of a little log analyser I wrote that breaks requests down by ASN (and other metrics) over the past hour. There were 15,573 requests from AS23969 across 82 unique IP addresses. Have a look at where those IP addresses came from:

Authentication and the Have I Been Pwned API

There is no conceivable way that this is legitimate, organic usage of the API from Thailand. The ASN is owned by TOT Public Company Limited, a local Thai telco that somehow, has ended up with a truckload of IP addresses hitting HIBP at just the right rate to not trigger the rate limit. The next top ASN is Biznet Networks in Indonesia. Then Claro in Brazil. After that there's Digital Ocean and then another Indonesian telco, Telkomnet. It makes for a geographical spread that's entirely inconsistent with legitimate usage of genuine consumers (no, HIBP isn't actually big in Iran!):

Authentication and the Have I Been Pwned API

Late last year after seeing a similar pattern with a well-known hosting provider, I reached out to them to try and better understand what was going on. I provided a bunch of IP addresses which they promptly investigated and reported back to me on:

1- All those servers were compromised. They were either running standalone VPSs or cpanel installations.

2- Most of them were running WordPress or Drupal (I think only 2 were not running any of the two).

3- They all had a malicious cron.php running

This helped me understand the source of the problem, but it didn't get me any closer to actually blocking the abusive behaviour. For the sake of transparency, let me talk about how I tried to tackle this because that will help everyone understand why I've arrived at a very different model to what I started with.

Combating Abuse with Firewall Rules

Firewall rules on Cloudflare are amazingly awesome. It takes just a few seconds to have a rule like this in place:

Authentication and the Have I Been Pwned API

Make more than 40 requests in a minute and you're in the naughty corner for a day. Only thing is, that's IP-based and per the earlier section on abusive patterns, actors with large numbers of IP addresses can largely circumvent this approach. It's still a fantastic turn-key solution that seriously raises the bar for anyone wanting to get around it, but someone determined enough will find a way.

No problems, I'll just take abusive ASNs like the Thai one above and give them the boot. I scripted a lot of them based on patterns in the log files and create a firewall rule like this:

Authentication and the Have I Been Pwned API

That works pretty quickly and is very effective, except for the fact that there's an awful lot of ASNs out there being abused. Plus, it has side-effects I'll come back to shortly too.

So how about looking at user agent strings instead? I mean could always just block the ones bad actors are using, except that was never going to work particularly well for obvious reasons (you can always define whatever one you like). That said, there were a heap of browser UAs which clearly were (almost) never legitimate for a client making API calls. So I blocked these as well:

Authentication and the Have I Been Pwned API

That shouldn't have come as a surprise to anyone as the API docs were actually quite clear about this:

The user agent should accurately describe the nature of the API consumer such that it can be clearly identified in the request. Not doing so may result in the request being blocked.

Problem is, people don't read docs and I ended up with a heap of default user agents (such as curl's) which were summarily blocked. And, of course, the user agent requirement was easily circumvented as I expected it would be and I simply started seeing randomised strings in the UA.

Another approach I toyed with (very transiently) was blocking entire countries from accessing the API. I was always really hesitant to do this, but when 90% of the API traffic was suddenly coming from a country in West Africa, for example, that was a pretty quick win.

I'm only writing about this here now because as the new model comes into place, all of this will be redundant. Plus, I wanted to shed some light on the API behaviour some people may have previously seen which they couldn't quite work out, and that brings me to the next section.

The Impact on Legitimate Usage

The attempts described above to block abuse of the API also blocked a lot of good requests. I feel bad about that because it made something I'd always intended to be easily accessible difficult for some people to use. I hope that by explaining the background here, people will understand why the approaches above were taken and indeed, why the changes I'm going to talk about soon were necessary.

I got way too many emails from people about API requests being blocked to respond to. Often this was due to simply not meeting the API requirements, for example providing a descriptive UA string. Other times it was because they were on the same network as abusive users. There were also those who simply smashed through the rate limit too quickly and got themselves banned for a day. Other times, there were genuine API users in that West African country who found themselves unable to use the service. I was constantly balancing the desire to make the API easily accessible whilst simultaneously trying to ensure it wasn't taken advantage of. In the end, the path forward was clear - the API would need to be authenticated.

The New Model: Authenticated Requests

I held back on this for a long time because adding auth to the API adds a barrier to entry. It also adds coding effort on my end as well as management overhead. However, by earlier this year it became clear that this was the only way forward: requests would have to be auth'd. Doing this solves a heap of problems in one fell swoop:

  1. The rate limit could be applied to an API key thus solving the problem of abusive actors with multiple IP addresses
  2. Abuse associated to an IP, ASN, user agent string or country no longer has to impact other requests matching the same pattern
  3. The rate limit can be just that - a limit rather than also dishing out punishment via the 24 hour block

Making an authenticated call is a piece of cake, you just add an hibp-api-key header as follows:

GET https://haveibeenpwned.com/api/v3/breachedaccount/test@example.com
hibp-api-key: [your key]

However, this wasn't going to completely solve the problem, rather it moved the challenge to the way in which API keys were provisioned. It's no good putting controls around the key itself if a bad actor could just come along and register a heap of them. Anti-automation on the form where a key can be requested is one thing, stopping someone from manually registering, say, 20 of them with different email addresses and massively amplifying their request rate is quite another. I had to raise the bar just high enough to dissuade people from doing this, which brings me to the financial side of things.

There's a US$3.50 per Month Fee to Use the API

Clearly not everyone will be happy with this so let me spend a bit of time here explaining the rationale. This fee is first and foremost to stop abuse of the API. The actors I've seen taking advantage of it are highly unlikely to front up with a credit card and provide what amounts to personally identifiable data (i.e. make a credit card payment) in order to mass enumerate the API.

In choosing the $3.50 figure, I wanted to ensure it was a number that was inconsequential to a legitimate user of the service. That's about what a latte costs at my local coffee shop so spending a few bucks a month to search through billions of records seems like a pretty damn good deal, especially when that rate limit enables 57.6k requests per day.

One thing I want to be crystal clear about here is that the $3.50 fee is no way an attempt to monetise something I always wanted to provide for free. I hope the explanation above helps people understand that, and also the fact the API has run the last 5 and a half years without any auth whatsoever clearly demonstrates that financial gain has never been the intention. Plus, the service I'm using to implement auth and rate limits comes with a direct cost to me:

Authentication and the Have I Been Pwned API

This is from the Azure API Management pricing page which is the service I'm using to provision keys and control rate limits (I'll write a more detailed post on this later on - it's kinda awesome). I chose the $3.50 figure because it represents someone making one million calls. Some people will make much less, some much more - that rate limit represents a possible 1.785 million calls per month. Plus, there's still the costs of function executions, storage queries and egress bandwidth to consider, not to mention the slice of the $3.50 that Stripe takes for processing the payment (all charges are routed through them). The point is that the $3.50 number is pretty much bang on the mark for the cost of providing the service.

What this change does it simultaneously gives me a much higher degree of confidence the API will be used in an ethical fashion whilst also ensuring that those who use it have a much more predictable experience without me dipping deeper and deeper into my own pocket.

The API is Revving to Version 3 (and Has Some Breaking Changes)

With this change, I'm revising the API up to version 3. All documentation on the API page now reflects that and also reflects a few breaking changes, the first of which is obviously the requirement for auth. When using V3, any unauthenticated requests will result in an HTTP 401.

The second breaking change relates to how the versioning is done. Back in 2014, I wrote about how your API versioning is wrong and headlined it with this graphic:

Authentication and the Have I Been Pwned API

I outlined 3 different possible ways of expressing the desired version in API calls, each with their own technical and philosophical pros and cons:

  1. Via the URL
  2. Via a custom request header
  3. Via the accept header

After 4 and a bit years, by far and away the most popular method with an uptake of more than 90% is versioning via the URL. So that's all V3 supports. I don't care about the philosophical arguments to the contrary, I care about working software and in this case, the people have well and truly spoken. I don't want to have to maintain code and provide support for something people barely use when there's a perfectly viable alternative.

Next, I'm inverting the condition expressed in the "truncateResponse" query string. Previously, a call such as this would return all meta data for a breach:

GET https://haveibeenpwned.com/api/v2/breachedaccount/test@example.com

You'd end up with not just the name of the breach, but also how many records were in it, all the impacted data classes, a big long description and a whole bunch of other largely redundant information. I say "redundant" because if you're hitting the API over and over again, you're pulling but the same info for each account that appears in the same breach. Using the "truncateResponse" parameter reduced the response size by 98% but because it wasn't the default, it wasn't used that much. I want to drive the adoption of small responses because not only are they faster for the consumer, they also reduce my bandwidth bill which is one of the most expensive components of HIBP. You can still pull back all the data for each breach if you'd like, you just need to pass "truncateResponse=false" as true is now the default. (Just a note on that: you're far better off making a single call to get all breached sites in the system then referencing that collection by breach name after querying an individual email address.)

I'm also inverting the "includeUnverified" parameter. The original logic for this was that when I launched the concept of unverified breaches, I didn't want existing consumers of the API to suddenly start getting results for breaches which may not be real. However, with the passage of time I've come across a couple of issues with this and the first is that a heap of people consumed the API with the default params (which wouldn't include unverified breaches) and then contacted me asking "why does the API return different results to the front page of HIBP?" The other issue is that I simply haven't flagged very many breaches as unverified and I've also added other classes of breach which deviate from the classic model of loading a single incident clearly attributable to a single site such as the original Adobe breach. There are now spam lists, for example, as well as credential stuffing lists and returning all data by default is much more consistent with the ethos of considering all breached data to be in scope.

The other major thing related to breaking stuff is this:

Versions 1 and 2 of the API for searching breaches and pastes by email address will be disabled in 4 weeks from today on August 18.

I have to do this on an aggressive time frame. Whilst I don't, all the problems mentioned above with abuse of the API continues. When we hit that August due date, the APIs will begin returning HTTP 400 "Bad Request" and that will be the end of them.

One important distinction: this doesn't apply to the APIs that don't pull back information about an email address; the API listing all breaches in the system, for example, is not impacted by any of the changes outlined here. It can be requested with version 3 in the path, but also with previous versions of the API. Because it returns generic, non-personal data it doesn't need to be protected in the same fashion (plus it's really aggressively cached at Cloudflare). Same too for Pwned Passwords - there's absolutely zero impact on that service.

During the next 4 weeks I'll also be getting more aggressive with locking down firewall rules on the previous versions at the first sign of misuse until they're discontinued entirely. They're an easy fix if you're blocked with V2 - get an API key and roll over to V3. Now, about that key...

Protecting the API Key (and How My Problem Becomes Your Problem)

Now that API keys are a thing, let me touch briefly on some of the implications of this as it relates to those of you who've built apps on top of HIBP. And just for context, have a look at the API consumers page to get a sense of the breadth we're talking about; I'll draw some examples out of there.

For code bases such as Brad Dial's Pwny Corral, it's just a matter of adding the hibp-api-key header and a configuration for the key. Users of the script will need to go through the enrolment process to get their own key then they're good to go.

In a case like What's My IP Address' Data Breach Check, we're talking about a website with a search feature that hits their endpoint and then they call HIBP on the server side. The HIBP API key will sit privately on their end and the only thing they'll really need to do is stop people from hammering their service so it doesn't exceed the HIBP rate limit for that key. This is where it becomes their (your) problem rather than mine and that's particularly apparent in the next scenario...

Rich client apps designed for consumer usage such as Outer Corner's Secrets app will need to proxy API hits through their own service. You don't want to push the HIBP API key out with the installer plus you also need to be able to control the rate limit of all your customers so that it doesn't make the service unavailable for others (i.e. one user of Secrets smashes through the rate limit thus making the service unavailable for others).

One last thing on the rate limit: because it's no longer locking you out for a day if exceeded, making too many requests results in a very temporary lack of service (usually single digit seconds). If you're consuming the new auth'd API, handle HTTP 429 responses from HIBP gracefully and ask the user to try again momentarily. Now, with that said, let me give you the code to make it dead easy to both proxy those requests and control the rate at which your subscribers hit the service; here's how to do it with Cloudflare workers and rate limits:

Proxying With a Cloudflare Worker (and Setting Rate Limits)

The fastest way to get up and running with proxying requests to V3 of the HIBP API is with a Cloudflare Worker. This is "serverless code on the edge" or in other words, script that runs on Cloudflare's 180 edge nodes around the world such that when someone makes a request for a particular route, the script kicks in and executes. It's easiest just to have a read of the code below:

Stand up a domain on Cloudflare's free tier (if you're not on there already) then it's $5 per month to send 10M queries through your worker which is obviously way more than you can send to the HIBP API anyway. And while you're there, go and use the firewall rules to lock down a rate limit so your own API isn't hammered too much (keeping in mind some of the challenges I faced when doing this).

The point is that if you need to protect the API key and proxy requests, it's dead simple to do.

"But what if you just..."

I'll get a gazillion suggestions of how I could do this differently. Every single time I talk about the mechanics of how I've built something I always do! The model described in this blog post is the best balance of a whole bunch of different factors; the sustainability of the service, the desire to limit abuse, leveraging the areas my skills lie in, the limited availability of my time and so on and so forth. There are many other factors that also aren't obvious so as much as suggestions for improvements are very welcomed, please keep in mind that they may not work in the broader sense of what's required to run this project.

Caveats

There's a couple of these and they're largely due to me trying to make sure I get this feature out as early as possible and continue to run things on a shoestring cost wise. Firstly, there's no guarantee of support. We do the same thing with entry-level Report URI pricing and it's simply because it's enormously hard to do with the time constraints of a single person running this. That said, if anything is buggy or broken I definitely want to know about it. Secondly, there's no way to retrieve or rotate the API key. If you extend the one-off subscription you'll get the same key back or if you cancel an existing subscription  and take a new one you'll also get the same key. I'll build out better functionality around this in the future.

Edit: You can now regenerate API keys by going to the API key page and clicking the "regenerate key" button. This will invalidate the old key and issue a new one.

I'm sure there'll be others that pop up and I'll expand on the items above if I've missed any here.

Summary

The changes I've outlined here strike a balance between making the API available for good purposes, making it harder to use for bad purposes, ensuring stability for all those in the former category and crucially, making it sustainable for me to operate. That last point in particular is critical for me both in terms of reducing abuse and reducing the overhead on me trying to achieve that objective and supporting those who ran into the previously mentioned blocks.

I expect there'll be many requests to change or evolve this model; other payment types, no payment at all for certain individuals or organisations, higher rate limits and so on and so forth. At this stage, my focus is on keeping the service sustainable as Project Svalbard marches forward and once that comes to fruition, I'll be in a much better position to revisit suggestions (also, there's a UserVoice for that). For now, I hope that this change leads to a much more sustainable service for everyone.

You can go and grab an API key right now from the API menu on the HIBP website:

Authentication and the Have I Been Pwned API

Edit: Just to be crystal clear, this doesn't impact Pwned Passwords. Cloudflare picks up pretty much all the costs for running that so the service is still freely accessible.

Pwned Passwords, Version 5

Pwned Passwords, Version 5

Almost 2 years ago to the day, I wrote about Passwords Evolved: Authentication Guidance for the Modern Era. This wasn't so much an original work on my behalf as it was a consolidation of advice from the likes of NIST, the NCSC and Microsoft about how we should be doing authentication today. I love that piece because so much of it flies in the face of traditional thinking about passwords, for example:

  1. Don't impose composition rules (upper case, lower case, numbers, etc)
  2. Don't mandate password rotation (enforced changing of it every few months)
  3. Never implement password hints

And of most relevance to the discussion here today, don't allow people to use passwords that have already been exposed in a data breach. Shortly after that blog post I launched Pwned Passwords with 306M passwords from previous breach corpuses. I made the data downloadable and also made it searchable via an API, except there are obvious issues with enabling someone to send passwords to me even if they're hashed as they were in that first instance. Fast forward to Feb last year and with Cloudflare's help, I launched Pwned Passwords version 2 with a k-anonymity model. The data was all still downloadable if you wanted to run the whole thing offline, but k-anonymity also gave people the ability to hit the API without disclosing the original password. Subsequent updates to the corpus of breached passwords saw versions 3 and 4 arrive as more passwords flowed in from new breaches whilst the system also continued to grow and grow:

Today, after another 6 months of collecting passwords, I'm releasing version 5 of the service. During this time I collected 65M passwords from breaches where they were made available in plain text (I don't crack passwords for this service). Due to Pwned Passwords already having 551M records as of V4, increasingly new corpuses of passwords are actually adding very few new ones so V5 contributes an additional... 3,768,890 passwords. That may not seem like a lot in comparison, but my virtue of an entire half year passing I wanted to get the existing public set updated to the current numbers. It doesn't just add new ones though, those 65M occurrences all contribute to the exiting prevalence counts for passwords that have been seen before.

New passwords include such strings as "Mynoob" (seen 1,208 times), "Find_pass" (303 times) and "guns and robots" (134 times). There's often biases in password distribution due to the sources they're obtained from, for example the prevalence of the service's name or other attributes or relationships to the breached site.

The entire 555,278,657 passwords are now available for download if you're running the service offline. If you're using the k-anonymity API then there's nothing more to do - I've already flushed cache at Cloudflare so you're now getting the latest and greatest set of bad passwords. If you want to be sure you're getting the latest data via the API, check the "last-modified" response header has a July date rather than a January date.

And just while I'm here talking about updates to the corpus of Pwned Passwords, I'm really conscious that releases are happening on a half-yearly cadence which means a bunch of new passwords sit on my side for months before anyone can start black-listing them. This is one of the things that's high on my post-Project Svalbard list; I'd love to see a constant firehose of new passwords being integrated into this service. Not six-monthly, not monthly and frankly, not even weekly - I want to see passwords in there as soon as I get them. The shorter the period between a breached password entering circulation and it appearing in Pwned Passwords, the more impact the service can have on the scourge of credential stuffing. Stay tuned!

As time has passed and more organisations have implemented the service, there's been some really fantastic implementations come out of the community. I wrote about a bunch of them last year in my post on Pwned Passwords in Practice, but it's the work they've done at EVE Online that really stands out:

Obviously these are all some of my favourite things (HIBP, 1Password and Report URI), but it's the improvements made to the user selection of passwords that makes me particularly happy:

When we first implemented the check, about 19% of logins were greeted with the message that their password was not safe enough. Today, this has dropped down to around 11-12% and hopefully will continue to go down.

That's a massive drop that has a profoundly positive impact not just on the individuals using EVE Online, but to the company itself too. Account takeover attacks are a massive problem on the web today and if you reduce the proportion of customers using known bad passwords by up to 42%, you make a direct impact on the cost the organisation has to bear when dealing with the problem.

The NTLM hashes have been really well-received too as they've allowed organisations to quickly check the proportion of their Active Directory users with known bad passwords. Consistently, I'm hearing the results of this exercise are... alarming:

I've been really happy to see a bunch of community offerings appear around the NTLM hashes in particular. Most notable is this one by Ryan Newington:

What's great about this work is that not only can it stop people from making bad password choices in the first place, you'll see there's a reference towards the bottom that'll allow you to run it against your entire set of AD users on demand. And just like Pwned Passwords itself, it's 100% free and you can go and grab it all right now.

So that's Pwned Passwords V5 now live. Implement the k-anonymity API with a few lines of code or if you want to run it all offline, download the data directly. Either way, take it and do awesome things with it!