Reflections on certificates, Part 2

By Enno Rey on 27 Apr 2023

I had initially planned to focus Part 2 of my blog post (Part 1) on discussing more use cases, but I thought I’d share some certificate best practices, to make this little series more practical 😉.

The following advice addresses three main risks:

Service outages due to expiring certificates or due to failing checks (as in: TLS handshake terminates as ‘something doesn’t match’).
Compromise of the private key.
Violation of some security objective to which a specific certificate is supposed to contribute (say, you use them for an explicit purpose, which you fully and clearly understand, right?). For example, a certificate is used for user authentication, but the only check performed by an endpoint is if the lifetime is valid.

It should be noted that measures mitigating (1) might increase the risks of (2) or (3) in the above list or the other way around. You’ll have to find the proper balance in your environment. This requires an understanding of the trust relationships, tradeoffs, and so on.

Furthermore, note that guidance coming from the fine folk in your infosec department is usually very much centred around (2) and (3). They don’t have to operate the services that actually employ certificates for one well-defined reason or another.

Being a fan of a ’10 golden rules’ approach (see a similar post on IPv6 security) I’ll make it ten. Also, some people using certificates occasionally refer to a ‘certificate lifecycle’, which could look like the following. This can help to understand the process.

Diagram of the certificate lifecycle. — Figure 1 — The certificate lifecycle.

Inventory

Understand — and potentially even better — document. Depending on your role it’s okay to just reflect on these questions:

Where are certificates used in your environment?
What purposes are they used for (and which related checks are performed — see below on different types)?
What lifetimes do they have, and what happens when they end?

There’s a fair chance that some of your services connect to external systems (via HTTPS, evidently), so include those in this exercise. Apply the principles laid out here as well, in talks (or for bold minds — audits) with the parties responsible for them.

Be prepared

Reflect on failure scenarios and how you want to deal with them. Discuss those with the relevant stakeholders (discussing during an outage caused by an expiring certificate if it’s ok to disable checking certificate lifetimes on a specific system/service — and looking for someone who approves the PR implementing such a change — might not be the best moment…). Maybe even write down the results of these conversations (runbooks come to mind).

This should include documenting how to revoke a cert in case of a key compromise as an emergency. This overall exercise mainly addresses risk (1), but the latter also risks (2).

Protect the privkeys

Do whatever is needed to protect private keys. This can include storing them in an encrypted manner, using an appropriate passphrase (which you don’t store together with the keys 😉), strictly limiting access rights to them, and limiting transfers. This is meant to protect against risk (2).

Inevitability when installing a certificate

At the very moment of installing a certificate on a system think long and hard about that future moment when it expires. Make sure that proper auto-renewal mechanisms are in place. In case of manual renewal, know who will be in charge, which steps to perform, and so on (did I already mention the value of runbooks?).

Align on use cases and objectives

A certificate is always used in a communication process (for example, between a client and a server, or between a user/system and a network device granting Wi-Fi or VPN access). These parties might belong to different organizations, have different security objectives, and might have a different understanding of what those security objectives imply for the (types and strictness of) checks to be performed. Aligning — via some type of communication — on those can have an impact both on avoiding and dealing with failure scenarios. I’m aware this may sound like a lot of overhead, but a short conversation in advance can save you from some headaches later.

These conversations may involve infosec folk, perhaps even on both sides. This can generally lead to interesting learnings, and to quite a few ‘oops, we thought it was ok to…’ moments. Remember the above example of performing certificate-based user authentication and just looking at the validity period? Of course, such a thing would never happen In real life. Never!

Automation is your friend

We all know that automating operational procedures is pretty much always a good idea, but there are probably not many domains where this is so true as when certificates come into play. This does not only apply to renewal — where things have gotten significantly better in the last years but also to initial deployment in distributed settings, for example, on load balancers or on Wi-Fi controllers — where, in some spaces, things might not yet be fully there. It pays to spend ~~some~~ significant energy on this — you will thank yourself later.

Understand which checks you really need

Generally, four types of checks can be differentiated:

Lifetime. This is the most basic check, and you might not even be able to disable it in a specific setting. You probably never want to ignore this one, but grace periods can save your ~~life~~ service uptime here + there, and that’s totally ok as long as the implications and tradeoffs (service availability vs strict security objectives) are well understood.
Identity. Again this is a basic check (‘am I connecting to the right server, represented by the certificate that it shows me?’), but this raises the question ‘how to define identity?’. Which identity does a wild card certificate constitute — those are not in use in your environment, you tell me? Well, at times developers *love* them (and Let’s Encrypt might happily hand them out once one has passed the initial domain validation). Ok, I get it, that’s only in dev, not in prod, you say? Also, it’s a common approach to use subject alternative names (SANs) in load-balanced settings, which can lead to interesting situations during troubleshooting. In short, identifying and checking might be more complex than they seem.
Other checks on various fields of a certificate (for example, parsing a piece from the distinguished name to determine some group membership, which in turn, leads to some security decision like authorizing access to a resource). In the context of this post, I have just one piece of advice for you — don’t!
Revocation checks. As I stated before, revocation checking usually opens a whole new can of worms, and it’s probably in this space where the objectives of operations personnel and infosec people most heavily differ. This brings me directly to the next point.

Be careful with revocation checking

Revocation checking brings new entities, roles and responsibilities, and processes to the picture. These can lead to all types of outage scenarios. On the other hand, you have to deal with the capability-inherent issue of revocation discussed in the first post. I know a number of environments explicitly foregoing revocation checking, for good operational reasons. Short lifetimes and proper renewal procedures can help to mitigate the related risks (‘compensating controls’ is favourable language when you talk to your infosec group or to ‘the auditors’).

Monitoring and alerting

Take care of proper monitoring and alerting, especially (but not only) in the context of expiring certificates. Activities in this domain mostly address risk (1). I will cover approaches and tools in more detail in a future post, but for now, from an operations perspective, this can be considered to be the most important element of this list, together with the next one.

Use auto-renewal wherever you can

This is simply based on the observation that certificate expiry is the most common outage reason. Automatic renewal (at least for the majority of certificates) is a must in most environments, and supporting technologies like ACME exists these days. Two quick notes here:

Do you want to immediately revoke an old certificate once a new one is generated? Doing so can avoid all types of interesting situations resulting from temporary co-existence, but doing so might also prevent you from undoing/rolling back changes in case that’s required.
Keep in mind that just pushing the new certificate(s) might not be enough. Very often services have to be restarted to use new certificates.

As a bonus, all of the elements of the certificate infrastructure itself, namely the CRL, should support IPv6 😉.

tl;dr

To increase the maturity of certificate use within an environment the following recommendations can be worthwhile to consider:

Inventory
Be prepared
Protect the privkey
Inevitability when installing a certificate
Align on use cases and objectives
Automation is your friend
Understand which checks you need
Be careful with revocation checking
Monitoring and alerting
Use auto-renewal wherever you can

I’m always happy to receive feedback or comments on practices in your world of certificates. Thank you for reading so far, and stay tuned for the next post of the series.

Enno Rey is a long-term IPv6 enthusiast with extensive practical experience in the space.

This post was originally published on the Internet Protocol Blog.

Rate this article

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.