Towards The Next DNS Fix

Ultimately, I can’t at all complain about armchair engineering.  The whole point of Source Port Randomization as an interim fix was to get things to the level that we could all have the big messy discussion about what to do now, without being illuminated by the actively burning state of the DNS infrastructure.

Now.  When it comes to fixing DNS, we have to operate under the same constraint as when we suggest fixes to web browsers.  Just as you’re not allowed to break the web, you’re not allowed to break DNS.  There are indeed many things we could do to make the web a safer place, “if only a bunch of people would re-code their web sites”.  That is, unfortunately, a naive approach that doesn’t actually lead to things getting any safer.  If nobody will deploy the fix, it’s just as if the fix didn’t happen.

We needed this DNS fix to happen.

As I’ve said a couple of times, Dan Bernstein was right.  Source Port Randomization (SPR) is not perfect — I’m pretty embarrassed that we didn’t recognize how common interactions would be with firewalls — but it’s a remarkably flexible and thorough improvement to the status quo.  When I said in my talk that there’s fifteen ways around the TTL, I wasn’t kidding.  From magic query types that are uncached by a recursive server, to nonexistent query types that are ignored by an authoritative server, there may not be a TTL to override.  Or perhaps the attacker actually provides records for 1.google.com, 2.google.com, 3.google.com, and so on.  In other words, the attacker might not even try to overwrite the NS for a domain — he may just want to get a domain in.  How would this be useful?  Consider the web security model, and Mike Perry’s research on cookies.  1.google.com will collect the cookie for Google just fine.

Or perhaps, as in the case of Google Analytics and Facebook and most large, CDN hosted sites, the actual TTL to override needs to be small, for reliability and scaling purposes.

In all of these situations, Source Port Randomization — a solution forged in 1999, long before we recognized all these problematic variant attacks — poses a significant barrier to attack.  It’s not a panacea, but it was never said to be one.  The hope, and it’s not unreasonable, is that it’s a lot easier for secondary defenses to detect and correct for a flood of billions of packets, than a couple of thousand.  SPR’s purpose was to provide a safer environment for an active discussion that would hopefully yield better fixes.  And that’s what it’s doing!

So, lets finally start talking about the better fixes that are emerging.  Specifically, the problem is — how do we stop the blind attacker who’s willing to send us four billion packets in order to pollute a name?  Four major strategies are, at least from what I’ve seen, making real strides towards a better fix.

1) DNSSEC. Say what you will about the perceived technical and political impossibility of this actually happening, but wow there’s been progress these last few weeks:  Besides lots of excited chatter that the roots are finally going to get signed, .GOV seems to be throwing some pretty serious resources at making DNSSEC happen. I’m neutral thus far on all the post-SPR solutions, and I’m really, aggressively neutral on DNSSEC.  The reality is there’s no harder task in all of IT than building a PKI, and the inescapable reality is that DNSSEC is a new identity infrastructure on the order of X.509.  It does solve the problems though, at least for the authoritative servers that opt into it, and the side benefits of having the system fixed in this particular way are rather compelling.

2) Layered Point Fixes. This is the approach Nominum is taking:  Basically, they’re bundling every point fix they can, and actively getting themselves into the position with their customers that as new bypasses are discovered, they can react quickly.  For example, when Nominum receives a packet with an incorrect TXID, they switch to TCP for that particular query.  This constrains an attacker in two ways:  First, they must force as many lookups as there are fake responses.  In other words, instead of being able to send 99.8% fake responses for each forced request, the attacker must send 50% requests, 50% responses.  Second, the attacker is constrained to the query rate that Nominum will actually send queries to a particular domain.

That alone, is not enough.  A slightly less efficient attack does not a fix make.  And so they port randomize.  But that too, is not enough — at least not for the long term.  And so they’re systematically building filters that attempt to detect as many weird variants as possible and attempt to address them on an attack-by-attack basis.

It’s certainly my preference to have a comprehensive fix.  But, pragmatically, I can’t deny that Nominum’s approach is yielding an increasingly harder target.

3) Attack Mode.  I’ll admit, this one appeals to me — that’s a change, I used to be a pretty staunch opponent, as I expect many people to still be.  But bear with me for a second.  Probably the most consistent signal of a blind cache poisoning attack is a spike in the number of responses received per second with an incorrect TXID (and, if you’re monitoring the network, incorrect destination port).  Even with a fully non-response upstream name server, this signal still survives, as the attacker needs to guess transaction IDs and ports and is going to for a very long time guess wrong.  This appears to hold true for all variants, known and even suspected.  Now, the concept of the SPR interim defense is that the brute force will either go too slow to be relevant for an attacker, or fast enough that the raw traffic levels will be noticed by even trivial network monitoring.

We can do better monitoring of DNS traffic with an IDS rather than just a traffic monitor, but you know who’s in a really good position to notice this attack?  The name server itself.  There’s no reason, inside the name server, that we can’t adapt to the attack — and change our posture to compensate.

Imagine for a moment that we monitored the absolute number of packets received with at least the wrong TXID.  (Depending on how we manage sockets, we might not see all the packets with the wrong source port.  We may not need to, or if we do, we can do so fairly trivially with libpcap filtering for source port 53.)  Assuming we were indeed receiving too many packets with the wrong transaction ID, we could deem ourselves…under attack.  What now?

I’ll tell you what we probably shouldn’t do:  Rate limit, either for all IP addresses, or for those that are specifically being spoofed.  (Remember, DNS servers enforce source address on incoming packets so they can correctly calculate bailiwicks — whether a particular server is allowed to speak for a given name in the first place.)  The problem with rate limiting is that, while it works very well to slow an attacker down, it also provides an attacker with a very consistent way to implement targeted denial of service attacks against DNS infrastructure.  Just flood bad replies, and the real reply will consistently get dropped.

A lot of security people are willing to tolerate DoS, in lieu of data corruption.  On one level, yes, it’s true, I’d rather have no service than corrupted service.  On the other, no service is in and of itself bad for business.  A trivial DoS that takes out Google for an ISP is more than just a problem — it’s a deployment blocker.

Again.  If nobody deploys your fix, it’s like you didn’t even write it.

That being said, DNS is a cruel mistress.  Due to the chained nature of DNS, reliable DoS attacks actually enable data corruption, by allowing an attacker to break the chain.  This has already been shown to cause headaches when an IPS blocks traffic to an authoritative server (mentioned earlier, and described in depth in my 2005 Black Ops talk).  But there are also implications to DNS clients, who will themselves now end up with nothing in their cache because a rate limited server couldn’t collect the data in the first place.

So, we shouldn’t drop traffic.  What can we do?  Perhaps, switch to TCP during the attack?  We know Nominum does this, at least on a per-query basis, when it detects an attack for that particular query.  So there’s some precedent.  But the resistance and nervousness around anything that allows you to force large numbers of servers to switch to TCP, for any reason, is significant.  It’s also impossible to ignore that a decent portion of recursive name servers cannot get 53/tcp out of their network, and that there are even  a good number of authoritative name servers that refuse to host their DNS records over TCP.

There’s much less fear around debouncing — at least, well scoped debouncing.  This is just the technical way of saying, if you’re not sure about something, look it up twice.  You do need to make sure you get the same answer back both times — or else an attacker just forces you to debounce, and hopes he gets his contrary answer in both times.  And there remains interesting questions about what to do when the answers legitimately differ, because they come from a CDN that shuffles responses on a per-response basis for load balancing.  What now?  I’d like to avoid TCP, and triple and quadruple querying is only a little more likely to generate multiple queries with the same reply.  One option is to make use of this trick thought up by this neat new nameserver Paul Vixie showed me — I can’t find it right now, but I’ll put a link up once I do.  The idea he had was to wait around a few hundred milliseconds, seeing if a real server would show up with another reply.  If so, there’s an attack.  Now, when he did this, he was doing it all the time, so it was killing performance on DNS for all users of the protocol (again, deployment blocker).  But we’d only be doing this in attack mode.

Yes, I think Akamai would accept slightly slower DNS resolution during an active attack against their particular names, on the particular name server that’s being attacked.

There is one funny variant we’d need to handle, if we were to depend on the real name server exposing the fake reply.  What if the real name server is non-responsive, for whatever reason?  I think the answer here is to handle situations where no answer comes back, by then and only then refusing to accept any packets from that IP address for ten seconds.  In other words, if a query fails, and nobody replies successfully, blackhole that server for ten seconds.  Legitimate servers have an easy way around this DoS — actually respond to that first query — so I think it’s the one DoS I can accept.

One matter that hadn’t really come up was scope.  There are three scopes we can defend against:  Per-query, per-NS, and global.  In other words, we can apply attack mode logic, whatever it may be, to one specific query, all queries to a name server that we see under attack, or all queries in the world.  My suspicion is that unless we actively detect attacks against just an absurd number of name servers (in other words, if the absolute number of incorrect TXIDs is not accounted for by any particular NS, thus meaning an attacker who doesn’t care which names he poisons as long as he gets someone), then per-NS scope is good.

I don’t like per-query, due to variants that it’s just not going to cover.  There’s some controversy here too, though, “query-fate-sharing” scares people a little.

So, in summary, all this ends up collapsing to some variant of:

Monitor the absolute rate of packets received with the wrong TXID, and possibly Port.  (BIND already does this — check the stats code.)
If the rate of packets exceeds some threshold — possibly dynamically set by the number of outstanding queries per second — start tracking which IP’s are “sending” packets with the wrong TXID/Port.
If there are too many NS’s to track, go into global attack mode.  Otherwise, go into per-NS attack mode for those NS’s, for ten seconds.  Hold this attack mode open as long as the spawning incorrect TXID/Port behavior continues, plus twenty seconds.  (This prevents twiddling attack mode on and off really fast, which defeats the purpose.)
During attack mode, debounce within the scope of that attack mode.  If two answers are received that disagree, issue a single query, and make sure one and only one reply comes back.  If no replies come back, suppress queries to that address for some small number of seconds.

The actual thresholds and constants would need to be figured out, but that’s roughly something I’m liking right now.  Sure, it looks complicated, but amusingly it’s still the simplest of the solutions listed thus far!

4) Case Sensitive DNS Responses (or ‘0×20′). This is David Dagon’s concept, and it’s interesting.  The concept is that DNS ignores case (www.foo.com is wWw.FOO.coM) but preserves case (if you ask for wWw.fOO.coM, you’ll get back wWw.fOO.coM).  So if we want more bits of randomness — if we want to get past 4 billion packets into more-packets-than-have-ever-been-sent-in-history — maybe we can use this trait.  As mentioned earlier, the problem with 0×20 is that an attacker can select names that don’t have enough case sensitive characters to add entropy.  Specifically, you can have numbers in a DNS name!  And so, when an attacker forces lookups for:

1.a11111111
1.a11111112
1.a11111113
1.a11111114
1.a11111115

0×20 can only provide one additional bit of entropy — and it’s not clear that one a is even required (it’s there to deal with the complaint ‘well, we’ll just detect completely numeric domains’).  And since all the above names have to be queried against the root servers, whoever corrupts those names gets to include whatever extra records he wants, because they’re all in bailiwick.  This is the exact problem that DNSSEC has — securing www.foo.com doesn’t just require securing foo.com, you also have to secure com and the roots themselves.  (XQID thought they got around this.  So close, but no.  I’ll post why later — this post is about fixes.)

Bottom line, 0×20 can’t secure the roots when there’s not enough characters to add sufficient entropy.

That being said, almost all real world names do have enough characters in them to add lots of entropy.  In fact, of all the non-DNSSEC solutions, 0×20 is the one that can not only work for the common case, but survive without source port randomization.  (The attack mode above just doesn’t work well enough when the attacker has a 1/65K chance of winning.)  It does need some coverage in those synthetic cases where there’s not enough entropy, or even in the real world cases of very short domains (ibm.com, for example).

Well, we have an entire debouncing framework described for Attack Mode.  Could we debounce when we don’t get enough entropy from the name?  Or perhaps we do so only when we detect 0×20 under attack, or is deployed on a network that from either the authoritative or recursive side canonicalizes away the case variation?

I’m not sure what the exact fix looks like.  But what’s clear to me is now is what I was pretty sure of back in March:  The real fix, the comprehensive fix, is not going to be trivial.  It may be DNSSEC, it may not be, but it’s not going to be a one-character call-it-a-day point fix.  Say what you will about Source Port Randomization — conceptually, it’s several orders of magnitude cleaner than everything that’s yielding fruit now.  Dan Bernstein’s solution is good.  Doing better — by crypto, by filtering, by defending ourselves, or by another entropy source — will be hard.

Not impossible, but not the sort of thing 16 engineers in a room could pragmatically hope to accomplish.

Please Do Not Destroy The DNS In Order To Save It

So someone put together a “one character” patch to fix the “dns flaw”, and it hit Slashdot.

Would that one character could really save the day here.

There’s a lot wrong here, the key fact being there are just so many ways around TTL, which itself was never designed to be a security technology in the first place. Gabriel’s trick addresses one particular scenario. It’s not at all enough. Consider:

First of all, you don’t actually know that a nameserver is ever going to provide you a record, or that that record is going to be cached. We’re seeing bugs in both conditions. For example, PowerDNS wasn’t providing responses on strange query types. CNN doesn’t reply at all to nonexistent names. So there may not be a TTL to bypass.

Secondly, the more major the site, the smaller the TTL. One of the issues described in my slides was the fact that nothing prevents an attacker from replying multiple times to a single outbound query. Presume you can get 500 replies in before the real server does. Given that, you have about a 1 in 131 chance of hijacking the record. With Google Analytics’ TTL at 300, that’s about 5 hours on average — and you don’t have to send 4 billion packets, you’re still sending just a couple tens of thousands.

If Google Analytics gets taken, the web pretty much gets taken — welcome to the power of <script src=”http://www.google-analytics.com”> putting foreign code into DOM’s around the world.

And it’s not like 300 is unusually low. Facebook’s at 30 seconds. That translates to about 30 minutes of security for Facebook — or their pizza’s free :)

But there are records that do have long TTL’s, and that’s where things get really dicey. The records with the longest TTL’s in the world are all name server records. Google’s NS records have TTL’s at 345K seconds. Microsoft’s NS records have TTL’s at 143K seconds. Whether that’s a good idea or a bad idea, it’s reality. We allow in-bailiwick overwrite of cached NS records precisely because these very long TTL’d records sometimes need to be overwritten anyway. When Gabriel writes:

What’s the downside to my patch ? I guess we are now holding an
authoritative server to the promise not to change the NS record for
the duration of the TTL, which is kinda what the TTL is for in the
first place :)

What he’s saying is that Google and Microsoft should accept situations where their website is down for up to 95 days hours (still too long). Now, granted, almost nobody’s going to actually hold onto a cached record for that long. But a single point of failure causing up to a week of residual outage out in the field is a very bad thing. A one character patch that caused such failures would be a serious problem indeed.

Now, all this being said, there’s lots of interesting thinking going on out there, and one of the things we all fully expected was a healthy discussion of all the possible options on the table. Maybe there’s a little more press than expected on one of those options, but I do think it’s good that we can now all see just how careful we need to be fixing this bug. There are a couple of approaches that are in fact converging on a safe and effective fix to the DNS, and I’ll be writing about them soon. In the meantime…nobody should presume any easy fix will actually solve the problem.

The Emergence Of A Theme

I’m not sure what it is, but there continues to be some sort of “competition” for “who can find the biggest bug” — as if attackers had to choose, and more importantly, as if any bug was so big that it could not be made even better by combined use with its “competition”.  Before my DNS talk, my old friend FX from Recurity Labs was comparing DNS issues to the Debian Non-Random Number Generator issue that caused all sorts of SSL certificates to offer no security value, and the SNMPv3 flaws that allowed infrastructure devices to be remotely administered by people who happened not to know the password.

Of course, after the talk, it became clear that the DNS hack and the Debian NRNG combined rather destructively — DNS allowed you to finally play MITM with all the SSL private keys you could trivially compute, and as Ben Laurie found, this included the keys for Sun’s OpenID authentication provider.  And, since the DNS hack turns Java back into a universal UDP and TCP gateway, we end up being able to log into SNMPv3 devices that would otherwise be protected behind firewalls.

So there’s no sense making a competition out of it.  There’s just an ever growing toolchest, growing from a single emerging theme:

Weaknesses in authentication and encryption, some which have been known to at least some degree for quite some time and many of which are sourced in the core design of the system, continue to pose a threat to the Internet infrastructure at large, both by corrupting routing, and making those corrupted routes problematic.

Back in July, the genuinely brilliant Halvar Flake posted the following regarding the entire DNS issue:

“I fail to understand the seriousness with which this bug is handled though. Anybody who uses the Internet has to assume that his gateway is owned.”

And thus, why 75% of my Black Hat talk was on the real-world effectiveness of Man-In-The-Middle attacks: Most people aren’t as smart as Halvar.  I’m certainly not :)  Almost nobody assumes that their gateway is owned — and even those that do, and try to engineer around it, deploy ineffective protections that are only “secure unless there’s an attacker”.

I say this is a theme, because it is the unifying element between some of the year’s most high profile flaws.  There are two subclasses — some involve weak authentication migrating traffic from one location to another, while others involve weak authentication allowing an attacker to read or modify traffic migrated to him — but you’d have to have some pretty serious blinders to not see the unifying theme of weak authentication leads to pwnage.

Consider:

Luciano Bello’s Debian NRNG: This involves a core design requiring the generation of random numbers, but the random number generator required a random seed, but alas, the seed was made insufficiently random.  It’s an implementation flaw, but barely — and the effect was catastrophic failure against members of the X.509 PKI authentication system that had used the Debian NRNG, and thus by extension SSL’s encryption logic and OpenID (for Sun’s) authentication gateway.

Wes Hardakar’s SNMPv3 Bug: Here, we have an authentication protocol that allows an attacker to declare how many bytes he wants to have to correctly provide.  Now, the attacker can claim “just 1 please” — and he gets into any router suffering this bug within seconds.  That, by extension, allows control over all traffic traversing that router.

Mike Zusman’s Insecure SSL-VPN’s: SSL is supposed to protect us, but there’s no sense creating a secure session to someone if you don’t actually know who they are.  Don’t worry though, by design anything that isn’t a web browser is terrifyingly likely to only to skip authentication entirely and just create an encrypted link to whoever’s responding.  One would think that SSL-VPN’s, whose sole purpose is to prevent attackers from accessing network traffic, would be immune.  But with 42% of certificates on the Internet being self-signed, and a lot of them being for SSL-VPN’s, one would be wrong.  By extension this auth failure exposes all traffic routed over these SSL-VPN’s.

Mike Perry’s Insecure Cookies: This gets interesting.  Here we have two different authentication protocols in place — one, from server to client, based on X.509.  The other, from client to server, based on a plaintext password (delivered, at least, over an encrypted session authenticated by the server-to-client cert).  But to prevent the user from needing to repeatedly type in their plaintext password, a password-equivalent token (or cookie) is handed to the user’s browser, which will be attached to every request within the securely encrypted channel.  Unfortunately, it’ll also be attached to every request which does not traverse the securely encrypted channel, because the cookies aren’t marked for secure-only.  Once the cookie leaks, of course, it’ll authenticate a bad guy who creates an encrypted session to that server.  So by extension bad guys get to play in any number of interesting sites.

My DNS flaw: Here we have a protocol that directly controls routing decisions, ultimately designed to authenticate its messages via a random number between 0 and 65535.  Guess the number, and change routing.  This was supposed to be OK, because you could only guess a certain number of times per day.  There was even an RFC entirely based around this time limit.  It turns out there’s a good dozen ways around that limit, allowing anonymous and even almost 100% packet spoofed compromise of routing decisions.  This, by extension, allowed exploitation of all traffic that was weakly authenticating.

It’s the same story, again and again.  And now, everyone talking about BGP.  So lets do the same sort of analysis on BGP:

Kapela and Pilosov’s BGP flaw: In BGP, only the nearest neighbor is authenticated.  The concept is that all “members of the club” authenticate all other members, while the actual data they provide and distribute is trusted.  If it’s not actually trusted, anyone can hijack traffic from anyone else’s routes.

Pilosov’s done some cool work here.  It’s not the sort of devastating surprise some people seem to want it to be.  Indeed, that’s what makes it so interesting.  BGP was actually supposed to be broken, in this precise manner. Literally, in every day use, any BGP administrator has always had the ability to hijack anyone else’s traffic.  Pilosov has a new, even beautiful MITM attack, but as mine was not the first DNS attack, his is not the first BGP MITM.  Tales of using BGP to force traffic through a compromised router (possibly compromised through SNMPv3) are legion, and Javascript and the browser DOM blur things pretty fiercely in terms of the relevance of being able to pass through to the legitimate endpoint anyway.

That’s not to take away from the work.  It’s an interesting trick.  But we need to level set here:

First, if you’re not part of the BGP club, you’re just not running this attack.  Pakistan took out YouTube with BGP — but some random kid with the ability to spoof IP packets couldn’t.  In other words, we’re just not going to see a Metasploit module anyone can run to complete these sorts of attacks.  Now, there are some entertaining combinatorics that could be played — DNS to enable Java’s SNMPv3 access to internal routers at an ISP, and then from that internal router running the sort of BGP tricks Pilosov’s talking about.  This goes back to the utter folly of trying to rank these bugs independently from one another.  But these sort of combinatorics are at a fundamentally different level than the fire-and-forget antics that DNS allowed, and on a fundamental level, the number of potential attackers (and the number of involved defenders) on BGP is a lot lower.

Second, we have far better logging — and thus accountability — in the BGP realm than we do perhaps for any other protocol on the Internet.  Consider the archives at APNIC — yes, that’s route history going back to 1999 — and Renesys has even more.  That sort of forensic data is unimaginable for anything else, least of all DNS.  BGP may have its fair share of bad actors — consider spammers who advertise temporary ranges in unused space for mail delivery purposes, thus getting around blackholes — but any of the really nasty stuff leaves a paper trail unmatched by any other attack.

Third, BGP is something of a sledgehammer.  Yes, you’re grabbing traffic — but your control over exactly what traffic you grab is fairly limited.  Contrast that with DNS, which allows astonishingly fine grained targeting over exactly what you grab — indeed, you don’t even need to know in advance what traffic you want.  The victim network will simply offer you interesting names, and you get to choose on the fly which ones you’ll take.  These names may even be internal names, offering the impossible-with-BGP attack of hijacking traffic between two hosts on the exact same network segment.

Finally, BGP suffers some limitations in visibility.  Simply grabbing traffic is nice, but bidirectional flows are better than unidirectional flows, and when you pull something off via DNS, you’re pretty much guaranteed to grab all the traffic from that TCP session even if you stop any further poisoning attempts.  Contrast that with BGP, which operates at Layer 3 and thus may cause the IP packets to reroute at any point when the TCP socket is still active.

So, does that mean its always better to attack DNS than BGP?  Oh, you competitive people would like things to be so simple, wouldn’t you :) Pilosov and I talked for about a half hour at Defcon, and I’ve got nothing but respect for his work.  Lets look at the other side of things for a moment.   First, BGP controls how you route to your name server — if not your recursive server, which may be inside your organization and thus immune to exterior routing protocol attack, then the authoritative servers your recursive servers depend on.  Something like this actually happened recently — witness the curious case of the Unauthorized L Roots, and note the astonishingly familiar potential attacks being described.  Yes, that’s precisely the scenario of BGP used to hijack root DNS servers — with such hijacking actually being noticed.

More importantly, much of my talk, in which I discuss the impacts of MITM attacks, applies to Kapela and Pilosov’s work as well.  It’s 2008, we still don’t have secure email, and that’s just as much of a problem in the face of BGP attacks as it is in the face of DNS attacks.

So, in summary, it’s an interesting side discussion regarding the similarities, differences, and overlaps between DNS and BGP attacks.   BGP has far fewer potential attackers, fewer necessary defenders, is a much less agile attack, and is way easier to monitor forensically (and indeed, with companies like Renesys, is being monitored forensically).  But so what?  It can work, and when it does, it can do much of the same damage we were afraid of via DNS.

We have now had three attacks, in one year, that underscore the fundamentally untrustworthy nature of routing.  DNS, BGP, and SNMPv3 all underscore the fact that the network should only be trusted as a best-effort data transmission system — that if you want to make sure everything’s OK, you can’t just assume — you need to cryptographically authenticate, you need to cryptographically encrypt, and you need to do these things to a level of security beyond “secure unless there’s an attacker.”

A lot of us — myself included, when I first started really looking at SSL — thought we were already distrusting the network.  We weren’t.  That’s what Mike Perry’s telling us, that’s what Mike Zusman’s telling us, and that’s what I’m telling you.

There are some real discussions to be had.  It’s 2008.  Where’s secure email?  Why is almost every autoupdater not from Microsoft thoroughly broken?  What is going on with non-browser network clients that can’t handle traffic from an untrusted server?  How are we going to migrate the web, and indeed all commercial network activity, to authenticated and encrypted protocols that respect the fundamentally untrustworthy nature of the network?

DNS vs. BGP vs. SNMPv3 is inside baseball.  The reality is as follows:

Weaknesses in authentication and encryption, some which have been known to at least some degree for quite some time and many of which are sourced in the core design of the system, continue to pose a threat to the Internet infrastructure at large, both by corrupting routing, and making those corrupted routes problematic.

The question is what to do about it.

(That all being said, I’ll be writing shortly with an update on defenses against DNS.  There be news.)

My (Not So) Little Pwnie

:)

Experimental Mail Server Analyzer Online

I’ve modified the test scripts slightly, to allow arbitrary triggering agents (such as a mail server) to report back the quality of their DNS queries.  You may very well be surprised what NS’s your mail servers are configured to use.  More often than you’d think, people just don’t know.

SIGGRAPH 2008: The Quest for More Pixels

So, last week, I had the pleasure of being stabbed, scanned, physically simulated, and synthetically defocused. Clearly, I must have been at SIGGRAPH 2008, the world’s biggest computer graphics conference. While it usually conflicts with Black Hat, this year I actually got to stop by, though a bit of a cold kept me from enjoying as much of it as I’d have liked. Still, I did get to walk the exhibition floor, and the papers (and videos) are all online, so I do get to write this (blissfully DNS and security unrelated) report.

SIGGRAPH brings in tech demos from around the world every year, and this year was no exception. Various forms of haptic simulation (remember force feedback?) were on display. Thus far, the best haptic simulation I’d experienced was a robot arm that could “feel” like it was actually 3 pounds or 30 pounds. This year had a couple of really awesome entrants. By far the best was Butterfly Haptics’ Maglev system, which somehow managed to create a small vertical “puck” inside a bowl that would react, instantaneously, to arbitrary magnetic forces and barriers. They actually had two of these puck-bowls side by side, hooked up to an OpenGL physics simulation. The two pucks, in your hand, became rigid platforms in something of a polygon playground. Anything you bumped into, you could feel, anything you lifted, would have weight. Believe it or not, it actually worked, far better than it had any right to. Most impressively, if you pushed your in-world platforms against eachother, you directly felt the force from each hand on the other, as if there was a real-world rod connecting the two. Lighten up a bit on the right hand, and the left wouldn’t get pushed quite so hard. Everything else was impressive but this was the first haptic simulation I’ve ever seen that tricked my senses into perceiving a physical relationship in the real world. Cool!

Also fun: This hack with ultrasonic transmitters by Takayuki Iwamoto et al, which was actually able to create free-standing regions of turbulence in air via ultrasonic interference. It really just feels like a bit of vibrating wind (just?), but it’s one step closer to that holy grail of display technology, Princess Leia.

Best cheap trick award goes to the Superimposing Dynamic Range (YouTube) guys. There’s just an absurd amount of work going into High Dynamic Range image capture and display, which can handle the full range of light intensities the human eye is able to process. People have also been having lots of fun projecting images, using a camera to see what was projected, and then altering the projection based on that. These guys went ahead and, instead of mixing a projector with a camera, they mixed it with a printer. Paper is very reflective, but printer toner is very much not, so they created a shared display out of a laser printout and its actively displayed image. I saw the effects on an X-Ray — pretty convincing, I have to say. Don’t expect animation anytime soon though :) (Side note: I did ask them about e-paper. They tried it — said it was OK, but not that much contrast.)

Always cool: Seeing your favorite talks productized. One of my favorite talks in previous years was out of Stanford — Synthetic Aperture Confocal Imaging. Unifying the output of dozens of cheap little Quickcams, these guys actually pulled together everything from Matrix-style bullet time to the ability to refocus images — to the point of being able to see “around” occluding objects. So of course Point Grey Research, makers of all sorts of awesome camera equipment, had to put together a 5×5 array of cameras and hook ‘em up over PCI express. Oh, and implement the Synthetic Aperture refocusing code, in realtime, demo’d at their booth, controlled with a Wii controller. Completely awesome.

Of course, some of the coolest stuff at SIGGRAPH is reserved for full conference attendees, in the papers section. One nice thing they do at SIGGRAPH however is ask everyone to create five minute videos of their research. This makes a lot of sense when what everyone’s researching is, almost by definition, visually compelling. So, every year, I make my way to Ke-Sen Huang’s collection of SIGGRAPH papers and take a look at the latest coming out of SIGGRAPH. Now, I have my own biases: I’ve never been much of a 3D modeler, but I started out doing a decent amount of work in Photoshop. So I’ve got a real thing for image based rendering, or graphics technologies that process pixels rather than triangles. Luckily, SIGGRAPH had a lot for me this year.

First off, the approach from Photosynth continues to yield Awesome. Dubbed “Photo Tourism” by Noah Snavely et al, this is the concept that we can take individual images from many, many different cameras, unify them into a single three dimensional space, and allow seamless exploration. After having far too much fun with a simple search for “Notre Dame” in Flickr last year, this year they add full support for panning and rotating around an object of interest. Beautiful work — I can’t wait to see this UI applied to the various street-level photo datasets captured via spherical cameras.

Speaking of cameras, now that the high end of photography is almost universally digital, people are starting to do some really strange things to camera equipment. Chia-Kai Liang et al’s Programmable Aperture Photography allows for complex apertures to be synthesized above and beyond just an open and shut circle, and Ramesh Raskar et al’s Glare Aware Photography evaded the megapixel race by filtering light by incident angle — a useful thing to do if you’re looking to filter glare that’s coming from inside your lens.

Another approach is also doing well: Shai Avidan and Ariel Shamir’s work on Seam Carving. Most people probably don’t remember, but when movies first started getting converted for home use, there was a fairly huge debate over what to do about the fact that movies are much wider (85% wider) than they are tall. None of the three solutions — Letterboxing (black bars on the top and bottom, to make everything fit), Pan and Scan (picking the “most interesting” square of video from the rectangular frame), or “Anamorphic” (just stretch everything) — made everyone happy, but Letterboxing eventually won. I wonder what would have happened if this approach was around. Basically, Avidan and Shamir find the “least energetic” line of pixels to either add or remove. Last year, they did this to photos. This year, they come out with Improved Seam Carving for Video Retargeting. The results are spookily awesome.


Speaking of spooky: Data-Driven Enhancement of Facial Attractiveness. Sure, everything you see is photoshopped, but it’s pretty astonishing to see this automated. I wonder if this is going to follow the same path as Seam Carving, i.e. photo today, video tomorrow.

Indeed, there’s something of a theme going on here, with video becoming inexorably easier and easier to manipulate in a photorealistic manner. One of my favorite new tricks out of SIGGRAPH this year goes by the name of Unwrap Mosaics. The work of Microsoft’s Pushmeet Kohli, this is nothing less than the beginning of Photoshop’s applicability to video — and not just simple scenes, but real, dynamic, even three dimensional motion. Stunning work here.

It’s not all about pixels though. A really fun paper called Automated Generation of Interactive 3D Exploded View Diagrams showed up this year, and it’s all about allowing complex models of real world objects to be comprehended in their full context. It’s almost more UI than graphics — but whatever it is, it’s quite cool. I especially liked the moment they’re like — heh, lets see if this works on a medical model! Yup, works there too.

As mentioned earlier, the SIGGRAPH floor was full of various devices that could assemble a 3D model (or at least a point cloud) of any small object they might get pointed at. (For the record, my left hand looks great in silver triangles.) Invariably, these devices work like a sort of hyperactive barcode scanner, monitoring how long it takes for the red beam to return to a photodiode. But here’s an interesting question: How do you scan something that’s semi-transparent? Suddenly you can’t really trust all those reflections, can you? Clearly, the answer is to submerge your object in fluorescent liquid and scan it with a laser tuned to a frequency that’ll make its surroundings glow. Clearly. Flurorescent Immersion Range Scanning, by Matthias Hullin and crew from UBC, is quite a stunt.

So you might have heard that video cards can do more than just push pretty pictures. Now that Moore’s Law is dead (how long have we been stuck with 2Ghz processors?), improvements in computational performance have had to come from fundamentally redesigning how we process data. GPU’s have been one of a couple of players (along with massive multicore x86 and FPGA’s) in this redesign. Achieving greater than 50x speed improvements over traditional CPU’s on non-graphics tasks like, say, cracking MD5 passwords, they’re doing OK in this particular race. Right now, the great limiter remains the difficulty programming the GPU’s — and, every month, something new comes to make this easier. This year, we get Qiming Hiu et al’s BSGP: Bulk-Synchronous GPU Programming. Note the pride they have with their X3D parser — it’s not just about trivial algorithms anymore. (Of course, now I wonder when hacking GPU parsers will be a Black Hat talk. Short answer: Probably not very long.)

Finally, for sheer brainmelt, Towards Passive 6D Reflectance Field Displays by Martin Fuchs et al is just weird. They’ve made a display that’s view dependent — OK, well, lenticular displays will show you different things from different angles. Yeah, but this display is also illumination dependent — meaning, it shows you different things based on lighting. There’s no electronics in this material, but it’ll always show you the right image with the right lighting to match the environment. Weird.

All in all, a wonderfully inspiring SIGGRAPH. After being so immersed in breaking things, it’s always fun to play with awesome things being built.

On The Flip Side

What was once possible via 32,769 packets, is still possible via between 134,217,728 and 4,294,967,296 packets.  Yep.  We’ve been saying that for a while now.  So has PowerDNS.  So has DJBDNS.  There’s nothing specific to BIND here, though I think most people understand that.

What’s going on here is a simple question:  Which would you rather build secondary layers of defense against?  Thousands of packet?  Or billions of packets?

Look.  We were looking at an attack before the patch that took ten seconds and was relatively invisible.  Four billion packets are many things, but subtle is not one of them.

So there’s a reason you’re not hearing anyone saying, “don’t patch.”  And there’s a reason we’ve been telling everyone this is just a stopgap — that we’re still in trouble with DNS, just less.  But in business, we choose our risks every single day.  That’s why it’s called risk management, not risk elimination. 

And for the most part, people seem to get it.  Even the story I was somewhat worried about when I’d heard about it — John Markoff’s piece in the New York Times — is a remarkably fair treatment of the issue.  Back in March, we needed to come up with a solution to this problem, that could viably protect as many people as possible in a short period of time.  DNSSEC has been in progress for nine years.  Asking people to deploy it over the course of a month would not have been a pragmatic approach.

DNSSec may be the long term fix.  It certainly was not the short term fix.

But can we do better than source port randomization?  Possibly.  Now comes the wild arguments about what we should really do, to fix this issue.  That’s fine by me.  That was the idea.  But nobody should ever think that they have the One True Fix.  I’ve been out here arguing against the 65536-to-1 lottery that fixed source port DNS is.  That’s not to say I haven’t been analyzing all the other designs too — but I have to prioritize on things that are in the field, putting customers at risk.  I think I’ve argued pretty persuasively that there’s no way 65536-to-1 can ever again offer a sufficient level of security.  I don’t know if anyone disagrees with that.

I’m looking forward to better options than source port randomization, even if it means we’ll accept the occasional Gig-E local LAN desktop flood attack hitting our 131M-to-1 to 2B-to-1 mitigations. 

It’s the call DJB made all those years ago.  It was the right call to make.

Now, lets talk about how entertaining a problem this is going to be to solved, to any degree past what DJB accomplished back in 1999.  Note, I’m not saying any final solution won’t use elements of everything that’s about to follow.  I’m just saying there are awesomely nasty attacks against everything, and people shouldn’t presume I or others won’t poke sucking chest wounds into seemingly elegant solutions.

First, the universal constraint on every solution is that it must cover the root servers, and the TLD’s, because they’re almost always a better target for poisoning since their position higher in the DNS heirarchy allows them to pollute any name below them.  In other words, you can totally opt foo.com into whatever security system you like, but unless A.GTLD-SERVERS.NET (the server for com) is itself secure — and unless the root servers that tell you where A.GTLD-SERVERS.NET — are also included in the solution — there’s no effective security whatsoever.  DNSSec, minus the DLV hack, suffers this specific issue, and so does everything else.  You either need to be backward compatible all the way up the heirarchy, a trait that port randomization and some other solutions have, or you need to push code to them.

It’s not an impossible proposition to get the root and TLD servers to modify their infrastructure.  But that’s not the sort of thing you can make happen via a secret meeting at Microsoft :)  It’s a definite negative if they have to change anything.

One solution I’ve sort of liked, believe it or not, is the 0×20 proposal.  Basically, this idea from (I think) David Dagon and Paul Vixie notices that DNS is case insensitive (lower case is equal to upper case) but case preserving (a response contains the same caps as a request).  So, we can get extra bits of entropy based on asking for wWw.DOXpaRA.cOM, and ignoring replies containing www.DOXPARA.com, WWW.doxpara.COM, etc.  0×20 has some notable corner cases, though — the shorter the domain, the less security can be guaranteed.  This is particularly problematic if you’re attacking the roots or TLD’s — especially considering the ability to have almost 100% numeric names, like:

1.a11111111
1.a11111112
1.a11111113
1.a11111114
1.a11111115

Another path that’s suggested is to double query — “debounce”, as one engineer suggested.  Debouncing is similar to the “run all DNS traffic over TCP” approach — seems good, up until the moment you realize you’d kill DNS dead.  There’s a certain amount of spare capacity in the DNS system — but it is finite, and it is stressed.  Absolutely there’s not enough to handle a 100% increase in traffic over the course of a month.

Now raised is the possibility of “attack modes” — large scale state transitions during periods where the server recognizes (it’s not exactly subtle) that it’s under attack.  These have a lot of potential, except for the reality that it creates something of a super amplification attack:  An attack invests a small number of packets to push every name server on the net into attack mode, and the DNS infrastructure implodes.

Those wouldn’t be very good headlines.

That’s not to say there aren’t more targeted mitigations — this is the sort of work Nominum has been doing with their server, to attempt to prevent individual targeted names from being poisoned.  The problem here is how as soon as the attack mode isn’t global, it becomes interesting for the attacker to repeatedly migrate out of the small range that’s in “lockdown” into a new target range.

And there are so many ranges.  There’s dozens of variations on the attacks I’m presenting here.  For just one example, I may have shown off how to attack www.foo.com via 83.foo.com, but that doesn’t mean it’s not useful to just attack 83.foo.com.  The web security model, to varying degrees, trusts arbitrary subdomains as elements of their parents.  This was ultimately how I was able to rickroll the Internet at Toorcon.

All the rate limiting approaches have issues with attacks outside the limited range — with the added bonus that somebody’s not getting a reply, a nasty trait that makes attacks on downstream customers of data much more feasible.

Eventually, people realize we could use a better source of entropy — perhaps a prefix on each query name (XQID) or an extra RR (resource record) containing a cookie.  Now we need cooperation from the authoritative side of the DNS house.  This is tricky, precisely because while it’s one thing to be proud of “+50% of the net has patched”, it’s quite another to say “well you can reach half the domains out there…”  The solutions I’ve seen do all have a story for backwards compatibility, storing/caching whether a given name server does or doesn’t support their particular variant.

But again, if the root servers, or the com servers, are not signed up for this system, there’s no incremental benefit to it:  The attacker can prevent your nice and secure name server from ever being used by other resolvers in the first place.

And so, we end up at cryptography. DNSSEC is one approach.  I don’t think I need to go into the pragmatic, political issues that have made this an issue.  From an engineering standpoint though, if we don’t have the headroom for TCP, do we really have the headroom for any cryptography though?  Maybe.  DNSSEC is not the only possible trick either.  Link-based crypto, either via DTLS (keyed via the existing PKI, using the NS name as the Subject) or some TKEY/TSIG dance, could also work.

So, there’s lots of options.  Lots, and lots, and lots of options.  But, throughout this entire process of analysis, one thing was very clear:

DJB was right.  Almost every attack we find, is strongly mitigated by source port randomization.  Mitigated, not eliminated, but mitigated just the same.  He may not have known how exactly to break BIND or MSDNS in 10 seconds in the real world — frankly, if he did, he’d have told us.  But he knew there had to be a way, as Hans Dobbertin knew in 1996 that eventually somebody was going to break MD5.  When Wang finally came out with her MD5 collisions in 2004, it wasn’t a surprise — MD5 had been federally decertified for years.  But it was still a pretty big deal, since certifications aside MD5 is everywhere.

Fixed source port DNS was everywhere.  Less so now.  I’m indescribably amazed and honored by that.  That’s a lot of hours by a lot of IT guys we’re looking at here.  I’m sure there are some pretty happy pizza shops right now.  But lets be clear — there are bad guys in the field, and they are using this attack in interesting ways.  People who are patched are much, much safer than people who are not.

Finally, as important a question as “how should DNS really be fixed” is, I think the real question of the day is “why does DNS matter so much?”.  From Halvar Flake’s first post — “What, doesn’t everyone assume their gateway is owned, and thus use SSL/SSH?” — the underlying instability is the continuing assumption that there is a difference between networks that are hostile and networks that are safe.   DNS is a great way to exploit that delusion — especially behind firewalls — but SNMPv3 and BGP both enable all the attacks I’ve found here.  Even if we go from 32 bits of entropy to 128 bits — even if we deploy DNSSec — we’re still going to deliver email insecurely.  We’re still going to have an almost entirely unauthenticated web.  We’re still going to ignore SSL certificate errors, and we’re still going to have application after application that can’t autoupdate securely.

That, at the end of the day, is a far larger problem than this particular DNS issue.

Summaries

Very nice summary of the “How” part of my talk here.

I do think “Why does DNS matter this much?” is a more important question.  It’s 2008 — why can I still not email securely between companies?  It’s a little sad that such a simple and basic bug can:

1) Break past most username/password prompts on websites, no matter how the site is built.
2) Break the Certificate Authority system used by SSL, because Domain Validation sends an email and email is insecure.
3) Expose the traffic of SSL VPNs, because heh, who needs to check certificates anyway
4) Force malicious automatic updates to be accepted
5) Cause millions of lines of totally unfuzzed network code to be exposed to attack
6) Leak TCP and UDP connectivity behind the firewall, to any website, in an attack we thought we already fixed twice now
7) Expose the traffic of tools that aren’t even pretending to be secure, because “it’s behind the firewall” or “protected by a split-tunneling IPsec VPN”.

It’s just DNS cache poisoning.  Why does it get to do this much damage? 

The whole “hostile vs. safe” network myth needs to die.  Every network is hostile — the DNS bug just made true something that should already have been assumed, but wasn’t.  And we need to get faster and better at fixing the infrastructure.  Using things until the moment of catastrophic failure — be they bridges, DNS, or MD5 — is a problem, and we can do better.

FX of Phenoelit made an important point a while back — everything you can do with this DNS attack, you can do with SNMPv3.  If you haven’t patched your routers — and that includes your internal routers, since Java’s giving UDP access out and you can thus issue SNMP queries with it (not their fault, the entire web security model collapses when DNS is broken and this is just yet another break) — you should probably do that too.

It’s going to be an interesting couple of months.  We’re going to see a lot of blended/combination attacks, as attacks we thought were infeasible in the real world suddenly start proving themselves entirely viable (at least, given insecure infrastructure).  The previously unfuzzed network clients are probably going to be particularly problematic – if you write a network app that is not a web browser, now is a good time to start feeding random (or even better, semi-random) data to it and switching the autoupdater to SSL.  New attacks are already popping up, only a few days in.  Ben Laurie just came out with a harrowing and beautiful advisory against some common OpenID deployments.  I knew about the intersection of DNS and OpenID, and I knew about the intersection of DNS and Debian’s badly generated certs (a problem which, I’d like to point out, is much harder to patch due to our continuing lack of an effective certificate revocation infrastructure).  But it took Ben Laurie to attack “Secure” OpenID providers using Debian Certs via DNS.  Fantastic, excellent work.

Best Thing Ever

Fake Dan Kaminsky is the best thing evar.  Mad bonus points for the Root Server Gas Pump.  I can’t even wrap my mind around how many shots I owe my crew right about now.

Pretty Pictures

Wow, this is pretty cool  :)  Post in comments if I should throw on the HD version.

Red — Unpatched
Yellow — Patched, but the NAT is screwing things up
Green — OK

(Update: HD Version, thanks to Clarified Networks!)

Next Page →