This is the third post in a series of four, in which we set out to revisit various BeyondCorp topics and share lessons that were learnt along the internal implementation path at Google.
The first post in this series focused on providing necessary context for how Google adopted BeyondCorp, Google’s implementation of the zero trust security model. The second post focused on managing devices - how we decide whether or not a device should be trusted and why that distinction is necessary. This post introduces the concept of tiered access, its importance, how we implemented it, and how we addressed associated troubleshooting challenges.
High level architecture for BeyondCorp
What is Tiered Access?
In a traditional client certificate system, certificates are only given to trusted devices. Google used this approach initially as it dramatically simplified device trust. With such a system, any device with a valid certificate can be trusted. At predefined intervals, clients prove they can be trusted and a new certificate is issued. It’s typically a lightweight process and many off-the-shelf products exist to implement flows that adhere to this principle.
However, there are a number of challenges with this setup:
- Not all devices need the same level of security hardening (e.g. non-standard issue devices, older platforms required for testing, BYOD, etc.).
- These systems don’t easily allow for nuanced access based on shifting security posture.
- These systems tend to evaluate a device based on a single set of criteria, regardless of whether devices require access to highly sensitive data (e.g. corporate financials) or far less sensitive data (e.g. a dashboard displayed in a public space).
The next challenge introduced by traditional systems is the inherent requirement that a device must meet your security requirements before it can get a certificate. This sounds reasonable on paper, but it unfortunately means that existing certificate infrastructure can’t be used to aid device provisioning. This implies you must have an additional infrastructure to bootstrap a device into a trusted state.
The most significant challenge is the large amount of time in between trust evaluations. If you only install a new certificate once a year, this means it might take an entire year before you are able to recertify a device. Therefore, any new requirements you wish to add to the fleet may take up to a year before they are fully in effect. On the other hand, if you require certificates to be installed monthly or daily, you have placed a significant burden on your users and/or support staff, as they are forced to go through the certification issuance process far more often, which can be time consuming and frustrating. Additionally, if a device is found to be out of compliance with security policy, the only option is to remove all access by revoking the certificate, rather than degrading access, which can create a frustrating all-or-nothing situation for the user.
Tiered access attempts to address all these challenges, which is why we decided to adopt it. In this new model, certificates are simply used to provide the device’s identity, instead of acting as proof of trust. Trust decisions are then made by a separate system which can be modified without interfering with the certificate issuance process or validity. Moving the trust evaluation out-of-band from the certificate issuance allows us to circumvent the challenges identified above in the traditional system. Below are three ways in which tiered access helps address these concerns.
Different access levels for different security states
By separating trust from identity, we can define infinite levels of trust, if we so desired. At any point in time, we can define a new trust level, or adjust existing trust level requirements, and reevaluate a device's compliance. This is the heart of the tiered access system. It provides us the flexibility to define different device trust criteria for low sensitivity applications from those used for high trusted applications.
Solving the bootstrapping challenge
Multiple trust states enable us to use the system to initiate an OS installation. We can now allow access to bootstrapping (configuration and patch management) services based solely on whether we own the device. This enables provisioning to occur from untrusted networks allowing us to replace the traditional IP-based checks.
Configurable frequency of trust evaluations
The frequency of device trust evaluation is independent from certificate issuance in a tiered access setup. This means you can evaluate trust as often as you feel necessary. Changes to trust definitions can be immediately reflected across the entire fleet. Changes to device posture can similarly immediately impact trust.
We should note that the system’s ability to quickly remove trust from devices can be a double edged sword. If there are bugs in the trust definitions or evaluations themselves, this can also quickly remove trust from ‘good’ devices. You must have the ability to adequately test policy changes to mitigate the blast radius from these types of bugs, and ideally canary changes to subsets of the fleet for a baking period. Constant monitoring is also critical. A bug in your trust evaluation system could cause it to start mis-evaluating trust. It’s wise to add alarms if the system starts dropping (or raising) the trust of too many machines at once. The troubleshooting section below provides additional techniques to help minimize the impact of misconfigured trust logic.
How did we define access tiers?
The basic concept of tiers is relatively straightforward: access to data increases as the device security hardening increases. These tiers are useful for coarse grain access control of client devices, which we have found to be sufficient in most cases. At Google, we allow the user to choose the device tier that allows them to weigh access needs with security requirements and policy. If a user needs access to more corporate data, they may have to accept more device configuration restrictions. If a user wants more control over their device and less restrictions but don’t need access to higher risk resources, they can choose a tier with less access to corporate data. For more information about the properties of a trusted platform you can measure, visit our paper about Maintaining a Healthy Fleet
We knew this model would work in principle, but we didn’t know how many access tiers we should define. As described above, the old model only had two tiers: Trusted and Untrusted. We knew we wanted more than that to enable trust build up at the very least, but we didn’t know the ideal number. More tiers allow access control lists to be specified with greater fidelity at the cost of confusion for service owners, security engineers, and the wider employee base alike.
At Google, we initially supported four distinct tiers ranging from Untrusted to Highly-Privileged Access. The extremes are easy to understand: Untrusted devices should only access data that is already public while Highly-Privileged Access devices have greater privilege internally. The middle two tiers allowed system owners to design their systems with the tiered access model in mind. Certain sensitive actions required a Highly-Privileged Access device while less sensitive portions of the system could be reached with less trusted devices. This degraded access model sounded great to us security wonks. Unfortunately, employees were unable to determine what tier they should choose to ensure they could access all the systems they needed. In the end, we determined that the extra middle tier led to intense confusion without much benefit.
In our current model, the vast majority of devices fit in one of three distinct tiers: Untrusted, Basic Access, and Highly-Privileged Access. In this model, system owners are required to choose the more trusted path if their system is more sensitive. This requirement does limit the finesse of the system but greatly reduces employee confusion and was key to a successful adoption.
In addition to tiers, our system is able to provide additional context to access gateways and underlying applications and services. This additional information is useful to provide finer grained, device-based access control. Imposing additional device restrictions on highly sensitive systems, in addition to checking the coarse grain tier, is a reasonable way to balance security vs user expectations. Because highly sensitive systems are only used by a smaller subset of the employee population, based on role and need, these additional restrictions typically aren’t a source of user confusion. With that in mind, please note that this article only covers device-based controls and does not address fine-grained controls based on a user’s identity.
At the other end of the spectrum, we have OS installation/remediation services. These systems are required in order to support bootstrapping a device which by design does not yet adhere to the Basic Access tier. As described earlier, we use our certificates as a device identity, not trust validation. In the OS installation case, no reported data exists, but we can make access decisions based on the inventory data associated with that device identity. This allows us to ensure our OS and security agents are only installed on devices we own and expect to be in use. Once the OS and security agents are up and running, we can use them to lock down the device and prove it is in a state worthy of more trust.
How did we create rules to implement the tiers?
Device-based data is the heart of BeyondCorp and tiered access. We evaluate trust tiers using data about each device at Google to determine its security integrity and tier level. To obtain this data, we built an inventory pipeline which aggregates data from various sources of authority within our enterprise to obtain a holistic, comprehensive view of a device's security posture. For example, we gather prescribed company asset inventory in one service and observed data reported by agents on the devices in other services. All of this data is used to determine which tier a device belongs in, and trust tiers are reevaluated every time corporate data is changed or new data is reported.
Trust level evaluations are made via "rules", written by security and systems engineers. For example, for a device to have basic access, we have a rule that checks that it is running an approved operating system build and version. For that same device to have highly-privileged access, it would need to pass several additional rules, such as checking the device is encrypted and contains the latest security patches. Rules exist in a hierarchical structure, so several rules can combine to create a tier. Requirements for tiers across device platforms can be different, so there is a separate hierarchy for each. Security engineers work closely with systems engineers to determine the necessary information to protect devices, such as determining thresholds for required minimum version and security patch frequency.
Rule Enforcement and User Experience
To create a good user experience, rules are created and monitored before being enforced. For example, before requiring all users to upgrade their Chrome browser, we monitor how many users will drop trust if that rule was enforced. Dashboards track rule impact on Googlers over 30 day periods. This enables security and systems teams to evaluate rule change impact before they affect end users.
To further protect employee experience, we have measures called grace periods and exceptions. Grace periods provide windows of a predefined duration where devices can violate rules but still maintain trust and access, providing a fallback in case of unexpected consequences. Furthermore, grace periods can be implemented quickly and easily across the fleet in case for disaster recovery purposes. The other mechanism is called exceptions. Exceptions allow rule authors to create rules for the majority while enabling security engineers to make nuanced decisions around individual riskier processes. For example, if we have a team of Android developers specializing on user experience for an older Android version, they may be granted an exception for the minimum version rule.
How did we simplify troubleshooting?
Troubleshooting access issues proves challenging in a system where many pieces of data interact to create trust. We tackle this issue in two ways. First, we have a system to provide succinct and actionable explanations to end users on how to resolve problems on their own. Second, we have the capability to notify users when their devices have lost trust or are about to lose trust. The combination of these efforts improves the user experience of the tiered access solution and reduces toil for those supporting it.
We are able to provide self-service feedback to users by closely integrating the creation of rule policy with resolution steps for that policy. In other words, security engineers who write rule policies are also responsible for attaching steps on how to resolve the issue. To further aid users, the rule evaluation system provides details about the specific pieces of data causing the failure. All this information is fed into a centralized system that generates user-friendly explanations, guiding users to self-diagnose and fix problems without the need for IT support. Likewise, a tech may not be able to see pieces of PII about a user when helping fix the device. These cases are rare but necessary to protect the parties involved in these scenarios. Having one centralized debugging system helps deal with all these nuances, enabling us to provide detailed and safe explanations to end users in accordance with their needs.
Remediation steps are communicated to users in several ways. Before a device loses trust, notification pop-ups appear to the user explaining that a loss of access is imminent. These pop-ups contain directions to the remediation system so the user can self-diagnose and fix the problem. This circumvents user pain by offering solutions before the problem impacts the user. Premeditated notifications work in conjunction with the aforementioned grace periods, as we provide a window in which users can fix their devices. If the issue is not fixed and the device goes out of compliance, there is still a clear path on what to do. For example, when a user attempts to access a resource for which they do not have permission, a link appears on the access denied page directing them to the relevant remediation steps. This provides fast, clear feedback on how to fix their device and reduces toil on the IT support teams.
In the next and final post in this series, we will discuss how we migrated services to be protected by the BeyondCorp architecture at Google.
In the meantime, if you want to learn more, you can check out the BeyondCorp research papers. In addition, getting started with BeyondCorp is now easier using zero trust solutions from Google Cloud (context-aware access) and other enterprise providers.
Thank you to the editors of the BeyondCorp blog post series, Puneet Goel (Product Manager), Lior Tishbi (Program Manager), and Justin McWilliams (Engineering Manager).