A couple of weeks ago I wrote an article about some issues with the Internet’s Public Key Infrastructure. In particular, I was looking at what happens if you want to ‘unsay’ a public key certificate and proclaim to the rest of the Internet that henceforth this certificate should no longer be trusted. In other words, I was looking at approaches to certificate revocation. Revocation is challenging in many respects, not the least of which is the observation that some browsers and platforms simply do not use any method to check the revocation status of a certificate and the resultant trust in public key certificates is uncomfortably unconditional.
I’ve had a number of conversations on this topic since posting that article, and I thought I would collect my own opinions of how we managed to create this rather odd situation where a system designed to instil trust and integrity in the digital environment has evidently failed in that endeavour.
I should admit at the outset that I have a pretty low opinion of the webPKI, where all of us are essentially forced to trust what is, for me, a festering mess of inconsistent behaviours and some poor operational practices that fail to even provide the palliative veneer of universal trust, let alone being capable of being a robust, secure and trustable framework.
So you’ve been warned: this is a strongly opinionated opinion piece!
We need a secure and trustable infrastructure. We need to be able to provide assurance that the service we are contacting is genuine, that the transaction is secured from eavesdroppers, and that we leave no useful traces behind us. Why has our public key certificate system failed the Internet so badly?
Is it that cryptography itself is letting us down?
It doesn’t appear to be the case. The underpinnings of public/private key cryptography are relatively robust, providing of course, that we choose key lengths and algorithms that are computationally infeasible to break. This form of cryptography is a feat worthy of any magical trick: a robust system where the algorithm is published, and even one of the two keys is published, but even when you provide material that was encrypted with the private key, this body of knowledge still makes the task of computing the private key practically infeasible. It’s not that the task is theoretically impossible, but it is intended to be practically impossible. The effort to exhaustively check every possible candidate value is intentionally impractical with today’s compute power and even with the compute power we can envisage in the coming years.
This bar of impracticality is getting higher because of the continually increasing computational capability, and with the looming prospect of quantum computing. It’s already a four-year old document, but the US NSA report published in January 2016 (CNSA Suite and Quantum Computing FAQ) proposes that a secure system with an anticipated 20-year secure lifetime should use RSA with key lengths be 3072 bits or larger and Elliptical Curve Cryptography using ECDSA with NIST P-384.
Let’s assume that we can keep ahead of this escalation in computing capability and continue to ensure that in our crypto systems the task of the attacker is orders of magnitude harder than the task of the user.
So it’s not the crypto itself that is failing us. Cryptography is the foundation of this secure framework, but it also relies on many other components. It’s these other related aspects of the PKI that are experiencing problems and issues. Here are a few:
- We are often incapable of keeping a secret. Anyone who learns your private key can impersonate you and nobody else can tell the difference.
- The relationship or authority that a public key certificate is supposed to attest might be subverted and the wrong party might be certified by a certification authority. And we’ve seen instances where trusted Certification Authorities (CAs) have been hacked, compromised or coerced to issue certificates to the wrong party under false pretences.
- The architecture of the distributed trust system used by the Internet’s PKI makes the system itself only as trustable as the worst-performing CA. It doesn’t matter how good your CA might be, if every user can be duped by a falsely issued certificate from a corrupted CA, then the damage has been done.
- For many years domain name certificates were orders of magnitude more expensive than domain name registrations, yet the system was continually undermined by poor operational practices with a result that these expensive instruments of trust were, in fact, untrustable.
- Trust anchors are distributed ‘blind’ by browser vendors and are axioms of trust: things certified by a trust anchor are automatically valid, but where a CA fails to maintain robust procedures and issues certificates under compromised conditions, we as users and end consumers of the security framework, are exposed without any conscious buy in: it just happens as a function of the existence of their trust anchor in our system’s software. We either have to be experts to know how to flush these out, or rely on others to help us update.
We’ve seen two styles of response to these problems. One is to try and fix these problems while leaving the basic design of the system untouched. The other is to run away and try something completely different.
Let’s fix this mess!
The fix crew have come up with many ideas over the years. Much of the work has concerned CA ‘pinning’. The problem is that the client does not know which particular CA issued the authentic certificate and if any of the other trusted CAs have been coerced or fooled into issuing a false certificate. The user would then be none the wiser when presented with this fake certificate. With around one hundred generally trusted CAs out there, there is an uncomfortably large attack surface. This has proved to be a tough problem to solve in a robust manner. The various pinning solutions proposed so far rely on an initial leap of faith in the form of ‘trust on first use’.
HTTP Public Key Pinning (HPKP) (RFC 7469) enjoyed some favour for a while but it has since been deprecated. The approach included a hash of the ‘real’ public key to be included in the delivered web content. As the RFC itself conceded it’s not a perfect defence against MTIM attackers, and it’s not a defence against compromised keys.
If an attacker can intrude in this initial HTML exchange, then the user can still be misled.
One deployed pinning solution is effective, namely the incorporation of the CA fingerprint for a number of domain names into the source code of the Google Chrome browser. While this works for Google’s domain names, it obviously doesn’t work for anyone else, so it’s not a generally useful solution to the pinning problem inherent in a very diverse distributed trust framework.
The fix crew also came up with Certificate Transparency (RFC 6962). The idea is that all issued certificates should be logged and the log receipt attached to the certificate. Users should not trust a certificate unless there is a log receipt attached to the certificate. A fraudulently issued certificate would not be accepted by a user unless it also had a duly signed log receipt. So even though a bad actor might be able to coerce a CA to issue a fake certificate, to ensure that users will trust this certificate, the bad actor will still have to log the certificate and attach the log receipt to the certificate in order to have the intended victim(s) accept the certificate. Each log entry is a certificate and it’s validated certificate chain. The logs are Merkle Tree Hash logs so that any form of tampering with the log will break the Merkle chain. The receipt of lodgement in one or more transparency logs is attached to the certificate as an extension. All this is intended to produce the result that an incorrectly issued certificate will be noticed. Users should not accept certificates that do not have an attached log receipt. A log may accept certificates that are not yet fully valid and certificates that have expired. As a log is irrevocable, revoked certificates are also maintained in the log. Again, all this sounds far better than it really is. The case of Symantec certifying example.com is a good illustration as to why this approach has its weaknesses. It took six months for someone to notice that particular entry in the transparency logs! Yes, that’s six months! As long as attacks extend over weeks or months then these logs might be useful, but in a world where an attack takes just a few minutes and where the attacker really doesn’t care about the trail they leave behind, these certificate transparency logs are again merely palliative measures.
The fix crew attacked the weak enrolment processes in certificates by creating a more rigorous form of enrolment termed ‘Extended Validation’ (EV) certificate. Aside from being a cynical exercise on the part of the certificate industry to create a more expensive class of certificates, these EV certificates appear to have been a complete failure. Users hardly notice the lock icon in the browser bar, and whether the lock is green-yellow or a shade of chartreuse is completely unnoticed. So the idea of making the certificate’s subject undertake more work, pay more money to generate a subtly distinguished public key certificate that produces barely noticeable results for end-users seems like a rather poor idea in my view!
And then there’s Let’s Encrypt who took the exact opposite path to try and “fix this mess”. Instead of expensive certificates that have a high-touch enrolment procedure, Let’s Encrypt went the other way with plentiful, free short-lived certificates issued through a fully automated process. It’s not that other CAs hadn’t fully automated their enrolment process, it’s just that Let’s Encrypt went there openly. The obvious outcome is that Let’s Encrypt is destroying any residual value in supposedly ‘high trust’ certificates by flooding the world with low trust (if any) short-term certificates. The proof of possession tests for such certificates are readily circumvented through either DNS attacks or host attacks on the web server systems. The counter argument is that the certificates are short-lived and any damage from such a falsely-issued certificate is time-limited. These certificates are good enough for low trust situations and nothing more, insofar as they provide good channel security, but only mediocre authenticity. But we are now dominated by the race to the bottom and these low trust certificates are now being used for everything, including fast attacks. After all, it’s not the CA you are using that determines your vulnerability to such attacks, but the CA that the attacker can use. A cynic might call this move to abundant free certificates with lightweight enrolment procedures a case of destruction from the inside.
No matter how hard the “let’s fix this” crew try, the window of vulnerability of fraudulently issued certificates is still around a minimum of a week, and the certificate system is groaning under even that load. So it looks pretty much as if the fix crew has failed. But there is a heap of money still in certificates, despite Let’s Encrypt, and a lot of people are still being paid to insist that the PKI certificate boat is still keeping itself above the waterline! They’re probably wrong, but their job is to deny that there is a problem even as their vessel is heading down to the seafloor.
The runaway crew headed to the DNS. The DNS is truly magical — it’s massive, it’s fast, it’s timely, and it seems to work despite being subject to consistent attacks of various forms and various magnitudes. And finally, after some 20 years of playing around, we have DNSSEC, which means that when I query your DNSSEC-signed zone I can assure myself that the answer I get from the DNS is authentic, timely and unaltered. And all I need to trust to pull this off is the root zone KSK key.
Not a hundred or so trust points, none of which back each other up, creating a hundred or more points of vulnerability, but a single anchor of trust. The DNS is almost the exact opposite of the PKI. In the PKI each CA has a single point of publication and offers a single service point. Part of the reason why OCSP is not well accepted is that CAs do not avail themselves of massively replicated service infrastructure. So, many trusted CAs, but a very limited set of CA publication points, each of which is a critical point of vulnerability. The DNS used an antithetical approach. A single root of a name hierarchy, but with the name content massively replicated in a publication structure that avails itself of mutual backup. DNSSEC has a single anchor of trust, but with many different ways to retrieve the data. Yes, you can manage your zone with a single authoritative server and a single unicast publication point and thereby create a single point of vulnerability, but you can also avail yourself of multiple secondary services— any cast-based load sharing, short TTLs giving the data publisher some degree of control over local caching behaviours.
The single trust model was, in fact, a tenet, a goal of the original RFC 3280 authors: they didn’t expect an explosion of many points of trust and had hoped the IETF was going to ‘step up’ and become some kind of de facto community-managed point of trust for most open-Internet contexts. This was never going to be accepted by banking and finance (who had already formed their closed group for credit cards) or the military (who had already adopted PKI for armed forces identity cards) or governments, but for common use among people, it would have been interesting had it become true.
Arguably DANE Is a ‘third model’. It is a venerable adage in Computer Science that any problem can be solved (or at least pushed to be a different problem!) by adding another layer of indirection and the IETF is nothing, if not experts at adding extra complexity, trowelling it on as an added complex layer of indirection.
So DANE. Let’s put these public keys in the DNS. After all, the thing we are trying to associate securely is a TLS public key with a domain name. Why do we have these middleware notaries called CAs? Why not just put the key in the DNS? DANE was always going to be provocative to the CA industry, and predictably they were vehemently opposed to the concept. There was strong resistance to adding DANE support into browsers: DNSSEC was insecure, the keys used to sign zones were too short, but the killer argument was “it takes too much time to validate a DNS answer”. Which is true. Any user of CZNIC’s TLSA validator extension in their browser found that the results were hardly encouraging as the DNSSEC validation process operated at a time scale that set new benchmarks in slow browsing behaviour. No doubt the validator could’ve been made faster by ganging up all the DNSSEC validation queries in parallel, but even if it did this the additional DNS round trip time would still have been noticeable.
The DNSSEC folk came up with a different approach. Rather than parallel queries, they proposed DNSSEC chained responses as additional data (RFC 7901). This approach relies on the single DNSSEC trust anchor. Each signed name has a unique validation path so the queries to retrieve the chain of interlocking DNSKEY and DS records are predictable, and it’s not the queries that are important, it’s the responses. Because all these responses are themselves DNSSEC-signed it does not matter how the client gets these responses — DNSSEC validation will verify that these are authentic, so it’s quite feasible for an authoritative server to bundle these responses up together with the original query. It’s a nice idea as it cuts the DNSSEC validation overhead to 0 additional RTTs. The only issue is that it becomes a point of strain to create very large UDP responses, because the Internet is just a little too hostile to fragmented UDP packets. DNS over TCP makes this simple, and with the current fascination with TLS-variants of DNS over TLS (DOT) and DNS over HTTPS (DOH), adding a chained validation package as additional data in a TCP/TLS response would be quite feasible. However, the DNS has gone into camel-resistance mode these days and new features in the DNS are being regarded with suspicion bordering on paranoia. So far the DNS vendors have not implemented RFC 7901 support, which is a shame because eliminating the time penalty for validation makes the same good sense as multi-certificate stapled OCSP responses (RFC 6691 and RFC 8445). It’s often puzzling to see one community (the TLS folk) say that a concept is a good idea and see the same concept be shunned in another (the DNS folk).
Then came DANE plus DNSSEC chain stapling as a TLS extension, similar to OCSP stapling. The fix folk were vehemently opposed. They argued that DNSSEC is commonly implemented in the wrong way (DNSSEC validation is commonly implemented in the recursive resolver not with the client’s system in the stub resolver). The problem with today’s model of DNSSEC validation is that the end client has no reason to implicitly trust any recursive resolver, nor are there any grounds to believe that an open unencrypted UDP exchange between a stub resolver and recursive resolver is not susceptible to a MITM attack. So what we are doing today with DNSSEC validation in the DNS is just the wrong mode they claim, and there is a strong element of truth here. Every endpoint needs to perform DNSSEC validation for themselves. We think that DNSSEC validation could scale from a few thousand recursive resolvers performing validation to a few billion end clients performing validation as the additional load would be absorbed by these same recursive resolvers. But that’s not the only problem of scaling up the system to reach all the endpoints. For example, is a KSK roll still feasible when there are a few billion relying parties that need to track the state of the transition of trust from one key to the next? But the current model of misplaced trust is not the only criticism of DNSSEC. DNSSEC’s crypto was too weak, they say. There is a common belief, that everyone uses RSA-1024 to sign in DNSSEC and these days that’s not a very strong crypto setting. There is the problem with stapled DNSSEC chain data that a man-in-middle can strip the stapled TLS extension as there is no proof of existence. None of these are in and of themselves major issues, although the stripping issue is substantive and would require some signalling of existence in the signed part of the certificate. However, it looks strongly that the PKI folk want a PKI solution, not a DNS solution and the DNS folk have largely given up trying to convince the PKI and browser folk to change their minds. So Firefox and Chrome continue to follow the ‘fix it’ path.
The TLS DNSSEC chain extension never got past a draft (draft-ietf-tls-dnssec-chain-extension-07) because the DNS proponents of that approach appeared to come to the realization that the PKI/browser folk were just too locked into a PKI-based approach and the PKI folk were convinced that patching up the increasing mess of PKI insecurity was a ‘better’ approach than letting DNSSEC into their tent. But maybe they have a point. Maybe it’s unwise to pin the entire Internet security framework into a single key— the DNS KSK root zone key— and hang all Internet security off this. Maybe it might be more resilient to use more than one approach so that we are not vulnerable to a single point of potential failure, the constant bleating of enterprise (and one or two national environments) who want mandatory HTTPS proxies so they can spy on what their users are doing. After all, they argue, deliberate compromised security for the ‘right’ motives is not a compromise at all!
Scaling is hard
The fundamental problem here is not (as was said at the start) the mathematics behind the cryptography. The problem is the organizational dynamics of managing these systems at scale, in a worldwide context. We not only have a distribution and management problem, we have different goals and intent: some people want to provide strong hierarchical controls on the certificates and keys because it entrenches their role in providing services. Some want to do it because it gives them a point of control to intrude into the conversation. Others want to exploit weaknesses in the system to leverage advantage. But end-users are simple. Users just want to be able to trust the websites and services they connect to and share their credentials and passwords with truly the ones they expected to be using. If we can’t trust our communications infrastructure then we don’t have a useful communications infrastructure.
What a mess!
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.