Short-Lived Certificates @ Netflix

Published in

FACILELOGIN

13 min readMay 11, 2016

Today we had our 6th Silicon Valley IAM meetup at the WSO2 office Mountain View. We are glad to have Bryan Payne from Netflix to talk on the topic — ‘PKI at Scale Using Short-Lived Certificates’. Bryan leads the Platform Security team at Netflix and prior to Netflix, he was the Director, Security Research at Nebula. This post is written based on Bryan’s talk at the meetup and other related resources.

What is a Short-Lived Certificate?

A short-lived certificate is identical to a regular certificate, except that the validity period is a short span of time such as a few days. Such certificates expire shortly, and most importantly, fail closed after expiration on clients without the need for a revocation mechanism. There are suggestions to make the certificate lifetime as short as four days, matching the average length of time that an OCSP response is cached.

Short-Lived certificates are nothing new — the research around this area even goes back to as early as 1998. The thing about short-lived certificates now is, we are in a world where our ecosystems are getting very large, we are having lot more automation, and all of a sudden it makes a lot of sense. If we talk about this four years before— we would only make people more and more confused. But today, everyone who hear about it are more than eager to find ways and means to get it done.

Bryan Payne from Netflix delivering his talk on short-lived certificates at the Silicon Valley IAM Meetup — @ WSO2–05/10/2016.

Server Public/Private Key Pair

Short-lived certificates need to be generated frequently. Does that mean each time when a new certificate is generated, the corresponding server has to generate a new set of public/private key pair, generate a CSR and get it signed by the certificate authority? No. During the certificate signing process, the public key of the server is signed by the certificate authority together with the expiration date and other related metadata. When regenerating the certificate in short time spans, the public key along with the metadata can be signed again, just by changing the expiration date. In other words, one can keep on generating short-lived certificates against the same server keys.

Bryan has a different idea on this. If you are concerned about losing the security keys, it’s not a good idea to keep the same server keys and renew the short-lived certificate just with a new expiration date. You need to regenerate the server keys, create a CSR and get a fresh certificate from the certificate authority. That’s the approach Netflix is following.

https://www.manning.com/books/microservices-security-in-action

Why Short-Lived Certificates?

Certificate revocation is a harder problem to solve — though there are multiple options available:

CRL (Certification Revocation List / RFC 2459)
OCSP (Online Certificate Status Protocol / RFC 2560)
OCSP Stapling (RFC 6066)
OCSP Stapling Required (draft-hallambaker-muststaple-00)

The CRL is a not more often used technique. The client who initiates the TLS handshake has to get the long list of revoked certificates from the corresponding certificate authority (CA) and then check whether the server certificate is in the revoked certificate list. For example every single time it touches a web site, the browser has to download this lengthy document from the corresponding certificate authority. Instead of doing that the browser can rather cache the CRL locally. Then you run into the problem that the security decisions are made based on stale data. Eventually the people recognized that CRLs are not going to work and started building something new, which is the OCSP.

The paper Towards Short-Lived Certificates[1] identifies following four drawbacks in CRL.
1. A study on real-world CRLs indicated that more than 30% of revocations occur within the first two days after certificates are issued. For CAs, there is a tradeoff between their CRL publishing frequency and operational costs. For CAs that update CRL with longer intervals, there is a risk of not blocking recently revoked certificates in time.
2. Since CRLs themselves can grow to be megabytes in size, clients often employ caching strategies, otherwise large transfers will be incurred every time a CRL is downloaded. This introduces cache consistency issues where a client uses an out-of-date CRL to determine revocation status.
3. Browsers have historically been forgiving to revocation failures (a.k.a “fail open”) so as not to prevent access to popular web sites in case their CAs are unreachable. In practice, they either ignore CRL by default, or do not show clear indications when revocation fails. Unfortunately, this lets a network attacker defeat revocation by simply corrupting revocation requests between the user and the CA.
4. It should also be noted that the location of the CRL (indicated by the CRL distribution point extension) is a noncritical component of a certificate description, according to RFC5280. This means that for certificates without this extension, it is up to to the verifier to determine the CRL distribution point itself. If it cannot CRLs may be ignored.

In the OCSP world the things were little bit better than CRL. The browser or the TLS client can check the status of a specific certificate without downloading the whole list of revoked certificates from the certificate authority. In other words each time the browser sees a web site, it has to talk to the corresponding OCSP responder to the validate the status of the server certificate. That creates one hell of a traffic on the OCSP responder. Once again clients still can cache the OCSP decision, but then again will lead to the same old problem of making decisions on stale data. OCSP also creates a single point of failure. Taking down few OCSP responders via a DDOS attack could possibly take down the entire internet.

Google has announced plans to disable altogether OCSP in Chrome — one of the world’s most popular browsers — and instead reuse its existing software update mechanism to maintain a list of revoked certificates.

What if the OCSP responder failed to respond? Should the be browser block the user from visiting the corresponding web site? Or just warn the user and let him/her give the option to proceed? In most of the cases what happens is the soft failure — which is the latter one.

The paper Towards Short-Lived Certificates[1] identifies following four drawbacks in OCSP.
1. OCSP validation increases client side latency because verifying a certificate is a blocking operation, requiring a round trip to the OCSP responder to retrieve the revocation status (if no valid response found in cache). A previous study indicates that 91.7% of OCSP lookups are costly, taking more than 100ms to complete, thereby delaying HTTPS session setup.
2. OCSP may provide real-time responses to revocation queries, however it is unclear whether the responses actually contains updated revocation information. Some OCSP responders may rely on cached CRLs on their backend. It was observed that DigiNotar’s OCSP responder was returning good responses well after they were attacked.
3. Similar to CRLs, there are also multiple ways that an OCSP validation can be defeated, including traffic filtering or forging a bogus response by a network attacker. Most importantly, revocation checks in browsers fail open. When they cannot verify a certificate through OCSP, most browser do not alert the user or change their UI, some do not even check the revocation status at all. We note that failing open is necessary since there are legitimate situations in which the browser cannot reach the OCSP responder.
4. OCSP also introduces a privacy risk: OCSP responders know which certificates are being verified by end users and thereby responders can, in principle, track which sites the user is visiting. OCSP stapling is intended to mitigate this privacy risk, but is not often used.

With OCSP stapling the browser or the client does not need to go to the OCSP responder each time it sees a web site. The web server, which hosts the web site will get the OCSP response from the corresponding OCSP responder and staple or attach the response to certificate itself. Since the OCSP response is signed by the corresponding certificate authority, the client can accept it by validating the signature. This makes things little better, instead of the client, now the web server has to talk to the OCSP responder. Even in this case, if the OCSP response is not attached to the certificate, the client will initiate a soft failure.

With OCSP must stapling, the server gives a guarantee to the client that the OCSP response is attached to the server certificate it receives during the TLS handshake. In case the OCSP response is not attached to the certificate, rather than doing a soft failure, the client must immediately reject the connection and block the user from visiting the corresponding web site.

How Short-Lived Certificates Work? — The Netflix Model

From the end user perspective the short-lived certificates behave in the same way as the normal certificates work today, instead the short-lived certificates have a very short expiration. The browser or the TLS client needs not to worry about doing CRL or OCSP validations against short-lived certificates, rather sticks into the expiration time, stamped on the certificate itself.

Netflix uses the Simian Army, which is a whole set of tools that introduce chaos to the running system.

The challenge in short-lived certificate mostly lies on its deployment and maintenance. Automation is the goddess of rescue!. This is not quite different from the high velocity deployments people have moved to today. In such deployments the key is to have a way to detect and fix any failures with closer to zero downtime — failures of course inevitable!. Netflix uses the Simian Army, which is a whole set of tools that introduce chaos to the running system. This is done deliberately and it helps Netflix to live in a world of chaos and recover from such situations with less or no impact at all on its business operations. Netflix also has the Chaos Kong, which does not kill individual instances, but a complete AWS region. These exercises are carried out at Netflix almost every month but nobody even notices. The same model can be followed in a short-lived certificated deployment — it can be used to test the system, stress the system and make sure it works under various failure scenarios.

A layered approach for a short-lived certificate deployment.

Netflix suggests using a layered approach to build a short-lived certificate deployment. You would have a system identity or a long-lived credentials that resides in a TPM (Trusted Platform Module) or an SGX (Software Guard Extension) having lot of security on it. Then use that credentials to get a short-lived certificate. Then use the short-lived certificate for your web service, which would be consumed by TLS clients. The web service can refresh the short-lived certificates regularly using its long-lived credentials. Having the short-lived certificate is not just enough — the underlying platform which hosts the service (or the TLS terminator) should support dynamic updates to the server certificate. A lot of TLS terminators out there support dynamically reloading the server certificates, but not with zero downtime in most of the cases.

It is assumed that long-lived credentials are well secured and hard to be compromised. If that is the case, then why do we need short-lived credentials? Why not use much secured long-lived credentials themselves? The answer lies on the performance. The long-lived credentials are well-secured with a TPM or an SGX. Loading such long-lived credentials very frequently is a costly operation.

A lot of TLS terminators out there support dynamically reloading the server certificates, but not with zero downtime in most of the cases.

At the end of the day Netflix proposes a deployment like the following. The human user deploys its own code into the Git repository and then the Spinnaker will take care of the continuous deployment and produces the AMI.

Spinnaker is an open source multi-cloud Continuous Delivery platform developed by Netflix for releasing software changes with high velocity and confidence.

During the startup, access to the long-lived credentials and short-lived credentials are provisioned to each instance. This credential bootstrap is done by Metatron, which is a tool at Netflix, which does credential management. At the moment Metatron is in its beta version and there are plans to open source it in the future. Once the initial credentials are provisioned to the server instance, it will talk to the Netflix Lemur API to get the short-lived certificates. Each time it gets a new certificate, the server environment will be updated with it.

Netflix Lemur manages TLS certificate creation. While not able to issue certificates itself, Lemur acts as a broker between CAs and environments providing a central portal for developers to issue TLS certificates with ‘sane’ defaults.

Behind the scene — short-lived certificates.

Microservices and Short-Lived Certificates

Netflix has the largest microservices deployment on earth. They use short-lived certificates to secure the interactions between microservices — more precisely, Netflix uses TLS mutual authentication with short-lived certificates to secure inter-microservices communication. Each microservice can act as both a client and a service.

The Netflix short-lived certificate deployment only focuses within the data center — not the public facing front. Its only the server to server communication — or the inter-microservices communication is protected with short-lived certificates.

Other Implementations/Research

The paper Towards Short-Lived Certificates[1] builds its certificate authority using Java and is served over Apache Tomcat as a web application. The web server issues an HTTP GET request to the certificate authority server specifying the common name for the certificate it wishes to retrieve, as well as a unique certificate identifier. This identifier allows a web server to have multiple certificates under the same common name stored by its one certificate authority.

These identifiers are chosen by the owners of the servers when they register with the certificate authority for short-lived certificates. They allow a server to have multiple certificates under a common name, say if they wish to use a different private/public key pair or want a certificate that is a wild card certificate and one that is domain specific. In either the pre-signed or on-demand mode, the CA’s servlet looks for an appropriate certificate on the filesystem. In on-demand mode, the validity period of the matching template certificate is updated and signed with the certificate authority’s private key. The private key is stored encrypted on the certificate authority’s server, and is decrypted and brought into memory at start-up.

The signing and general certificate handling is done using the IAIK cryptography libraries. The pre-signed certificates are signed offline using a different key and are transferred to the servers manually. Each pre-signed and and on-demand certificate is made valid for four days to match the length of time for which an OCSP response is cached.

They also implemented a server-side program in Java targeting Apache web servers. The program is set as a cron job executing every day. When the program runs, it checks to make sure the certificate is close to expiration. If true, it issues a GET request to the certificate authority for either a pre-signed or on-demand certificate. Once the certificate is obtained it is stored to the filesystem in the standard PEM format.

The Apache SSL configuration files are set such that the file locations of the certificates are symbolic links. When the new certificate is stored on the filesystem, the program has to re-point the symbolic link to the new certificate and optionally clean up old, expired ones. After this, the server certificate-downloading program issues a graceful restart command to Apache. This ensures the web server restarts and loads the new certificates without disrupting any existing connections.

At its principles, this model proposed by the paper Towards Short-Lived Certificates[1] does not quite differentiate from what Netflix follows.

The paper Protecting Browsers from Network Intermediaries [5] proposes short-lived certificates as an approach to improve TLS performance. According to their proposal certificate authorities could configure the validity period of short-lived certificates to match the average validity lifetime of an OCSP response that they measured in real-world, which was 4 days. Such certificates expire shortly, and most importantly, fail-closed (treating them as insecure) after expiration on clients without the need for a revocation mechanism. Further according to this proposal, when a web site purchases a year-long certificate, the certificate authorities response is a URL that can be used to download on-demand short-lived certificates. The URL remains active for the year, but issues certificates that are valid for only a few days.

Summary

Short-lived certificates are nothing new. The concept was there for couple decades and its making a slow progress in getting into the main stream. Most of the deficiencies found in CRL, OCSP, OCSP stapling and OCSP must stapling are addressed in short-lived certificates.

The short-lived certificates can be either used at the public facing front or just within a data center. The focus of Netflix at the moment is the latter. According to Bryan, it would take some time to build the certificate authority ecosystem to support short-lived certificates for public facing web sites.

References

Towards Short-Lived Certificates: http://www.w2spconf.com/2012/papers/w2sp12-final9.pdf
PKI at Scale Using Short-Lived Certificates: http://www.meetup.com/Silicon-Valley-IAM-User-Group/events/230537915/
Improving Revocation: OCSP Must-Staple and Short-lived Certificates: https://blog.mozilla.org/security/2015/11/23/improving-revocation-ocsp-must-staple-and-short-lived-certificates/
The current state of certificate revocation (CRLs, OCSP and OCSP Stapling): https://www.maikel.pro/blog/current-state-certificate-revocation-crls-ocsp/
Protecting Browsers from Network Intermediaries: http://repository.cmu.edu/cgi/viewcontent.cgi?article=1430&context=dissertations
An End-to-End Measurement of Certificate Revocation in the Web’s PKI: http://www.cs.umd.edu/~dml/papers/revocations_imc15.pdf