Humanizing Automation

After two crashes of the Boeing 737 Max 8, the planes have been grounded due to concerns about their automated systems. Whether or not MCAS turns out to be the root cause, this is but the latest example of a major technological trend: computers making mistakes when controlling systems that humans used to control.

When we automate control, we trade human mistakes for system errors. When an automated plane crashes into the ground or an autonomous car runs over a human, the natural answer is to put the human back in charge. After all, wouldn’t a person have avoided tragedy?

Such an answer, though, misses how things have changed. People generally cannot take over at a moment’s notice – they don’t have situational awareness when they aren’t doing the task full time. Further, people may not even be capable of making timely decisions because the system was designed for automated control.

Automated systems are going to take erroneous actions and people will not be able take over and fix things. However, that does not mean that we have to blindly accept those errors.

It is easier when a person can notice the problem in a timely fashion. Imagine if you were a passenger in a plane that was flying towards the ground and you happened to be in the cockpit. Even if you had no knowledge of how a plane works, you’d be able to see that something was wrong and be able to tell the pilots (forcefully) that they were doing something very bad. We can imagine controls that would convey the same information to an automated system.

However, what if there is nobody to notice the problem?

We see this problem frequently in computer security. Whether you are trying to protect one or a thousand computers, nobody wants to constantly monitor defense systems. We tune systems so that they exhibit low false positives, which largely prevents them from shutting down legitimate activities. The reduced sensitivity of such settings, however, also means that they miss many forms of attacks.

If automated systems are to serve human interests, they must always attempt to achieve human-specified goals, even when there is no human in the loop. Automation often fails, though, when it tries too hard to achieve goals that have been set for it. Runaway loops of goal seeking are often the root cause of disaster. Consider a stock trading program keeps trying to make money by trading even when every trade loses money – it never gives up, and so it keeps making the problem worse until outside forces intervene. Similarly, with the 737 Max 8 accidents, we seem to have pilots who were directly countering the actions of automated systems, and the automated systems keep fighting back until the plane hits the ground.

People can observe a situation and realize their actions aren’t achieving their goals. People may get frustrated, bored, angry, or scared; very rarely will they be oblivious as to whether or not their choices are succeeding. Automated systems, however, are generally built to “do what they are told” no matter the consequences.

We will never make automated systems perfectly accurate. But, perhaps we can build systems that know when they are failing and can seek outside guidance. Sometimes, admitting a problem is the most intelligent thing any system can do.


Buzzword propagation

I just wanted to briefly comment on today’s article in ArsTechnica by  entitled “Tech firms want to save the auto industry—and the connected car—from itself.”  Specifically, there is this quote near the end:

Symantec’s Anomaly Detection starts off learning what “normal” is for a particular model of car during the development process, building up a picture of automotive information homeostasis by observing CANbus traffic during production testing. Out in the wild, it uses this profile of activity to compare that to the car it’s running on, alerting the Symantec and the OEM in the event of something untoward happening.

Yes, they are doing anomaly detection, i.e. “automotive information homeostasis.”  How about that.

Two things I should point out, though.  First, their solution adds 6% overhead.  That seems awfully high, especially since program behavior on a car should be relatively easy to model.  I wonder what learning algorithms they are using?

Second, they are using static profiles in production.  Static profiles can certainly be tuned to reduce false positives; however, such a choice also guarantees no profile diversity and thus removes one of the key advantages of anomaly detection, namely the possibility of detecting zero-day attacks on some systems (through profile diversity).

But hey, it is certainly a step forward.


Automating defense versus offense

I just took a look at the Cyber Grand Challenge, a DARPA sponsored event that will showcase systems that can play CTF (capture the flag) autonomously.

This event scares me.

Developing automated attackers and automated defenders might appear to be a way develop techniques to automatically harden software.  Let the bots slug it out in the simplified, safe environment of the challenge and then, once they’ve proven themselves, throw them loose in the real world to defend real systems (or, at least, adapt their techniques to build practical defenses).

I am certain it won’t work out this way.

The attack techniques developed will generalize and will be very good at finding flaws in real systems.  The defensive techniques, however, will not generalize to the real world.  Thus the outcome of this challenge will be even better ways to attack systems and little improvement in protecting systems.

This difference will occur because in the real world defenses have to work while protecting normal system functioning.  The hard part about defense is not stopping the attacks, it is stopping the attacks while keeping your systems up and your users happy.  (After all, you can always stop the attack by just turning your system off.)  Sure, CTF competitions feature services that have to be kept running; these services are nothing like real-world services though, even when they are running the same binaries, simply because they aren’t being required to provide real services to real users.

Simulating real-world behavior accurately is equivalent to building a detector for anomalous behavior.  If you know what makes it “real”, you know what doesn’t belong.  It thus is not easy to do.  Past efforts in computer security to simulate realistic computer behavior for testing purposes have failed miserably (e.g., see John McHugh’s critique of the late 1990’s DARPA intrusion detection evaluations).

The Cyber Grand Challenge makes no effort to simulate a realistic environment; in fact, it was designed to emphasize reproducibility and determinism, two qualities production systems almost never have.  In this sort of environment it is easy to detect attacks and it is easy to verify that a response strategy does not harm the main defense.

The attackers are playing a game that is very close to what real-world attackers face.  The defenders, however, are facing a much simplified challenge that leaves out all of the really hard aspects of the problem.  Note this even goes for software patching, as the hard part of patching is making sure you didn’t miss any corner cases.  When legitimate traffic has no corner cases, you can get away with being a lot sloppier.

On the attack side clearly things are working when you have systems that  can find vulnerabilities that weren’t inserted intentionally (slide 37).  I didn’t see any, and I don’t expect to see any novel defenses, at least none that would ever work in practice.

Attacking is easy, defending is hard.  Automating defense is fundamentally different from automating attacks.  Only when we accept the true nature of this difference will we be able to change the balance of power in computer security.


Code Zombies

[This post was inspired by a discussion last week in my “Biological Approaches to Computer Security” class.]

Let’s talk about living code and dead code.

Living code is code which can change and evolve in response to new requirements.  Living code is a communications medium between the programmers of the past and those of the present.  Together, they collaborate on specifying solutions to software problems.  The more alive the code, the more active this dialog.

Dead code*, in this context, is code that is not alive.  It does not change in response to new requirements.  Dead code is part of a conversation that ended long ago.  Some dead code is truly dead and buried.  The code for Windows 1.0 I would characterize as being dead in this way.  Other dead code, however, still walks the earth.

I call these entities code zombies.  Others call them legacy code.

Code zombies died a long time ago.  The programmer conversations they were part of have long ended, and nobody is left who can continue them from where they left off.  Nobody understands this code, and nobody can really change it without almost rewriting it from scratch.  Yet this code is still run, is still relied upon.

Look around you – you’re surrounded by code zombies.  If you run commercial, proprietary software, you are probably running a lot of zombie code.  If you run open source, there are many fewer zombies around – but they do pop their heads up every so often.

Enterprises devote huge resources to maintaining their zombies.  Zombies aren’t good at taking care of themselves, and the repair jobs are often gruesome.  Sometimes a zombie needs to be brought back to life.  This can be done, at great effort and expense.  The result, however, is Frankenstein code: it may live, but boy is it not pretty, and it may turn around and bite you.

And here’s a funny thing: zombie code is insecure code.  Tamed zombies aren’t fussy about who they take orders from.  Living code, however, is part of a community that works to keep it safe.

I predict that the software industry will transform once enough people realize the costs of keeping zombies around outweigh the benefits.

* I know”dead code” has other meanings in computer science.

War is the exception

The imagery and terminology of war pervade computer security.  Intrusions, vulnerabilities, attackers, defenders – they are all militaristic.  While such terms may be be useful, that does not mean we should think we are at war on the Internet.  I say this for a very simple reason: war is always the exception.

Life as we know it is on pause when we are at war.  The rules that govern our productive lives – those that allow us to create, trade, and raise families – are all suspended when we are fighting for those lives.  The only good thing about war is its ending.  To be at war is to be fighting so we can be at peace.

This is what the computer security community must come to grips with: we are not at war today on the Internet.  If we were, then people would not be conducting business, socializing, learning, and falling in love online; instead, we would all be fighting for our (virtual) lives.

Now, it is true that in real war life does go on in some fashion; nevertheless, war is defined by fear, and this  fear infuses even the most mundane aspects of life.  While people are wary, they are not living in fear on the Internet.  We are at peace online.  And this is a good thing!

Why this matters in a computer security context is that what is appropriate for war is not appropriate for peacetime.  Arming the populace for cyber warfare can help prepare us for war; preparation for war, however, is often the surest way to destroy the peace.

Sacrifices that we willingly make in war are loathsome otherwise.  To forget this is to forget the greatest benefit of peace: the relative absence of fear.  Our job as computer security researchers and professionals should not be to spread fear, but rather to protect people from fear.

Our job is to keep the peace.

Statistical Mirages

I have to admit that I’ve always been suspicious of statistics as they are used in computer security.  Oddly enough, I also should have been suspicious of statistics in other sciences as well.  Turns out that when you examine large datasets with lots of tests, simply by random chance you are likely to find something, i.e., one of the tests is likely to come up with something “significant.”  Hence you get results that suggest breathing in bus exhaust is good for you.  In other words, you get false positives.

If anomaly detection is ever to become a fundamental defense technology, it will have to move beyond statistics to being grounded in the mechanisms of computers and the real behaviors of users.  This is going to take a while, because this is a lot harder than just running a bunch of tests on datasets.  Of course, given the current disrepute of anomaly detection in security circles, perhaps the door is wide open for better approaches.

RIGorous Cell-level Intrusion Detection

When looking to biology for inspiration on computer security, it is natural to first look at mammalian immune systems.  While they are amazingly effective, they are also mind-numbingly complex.  As a result, many researchers get seduced by the immune system’s architecture when there is so much to learn from its lower-level workings.

Case in point: every cell in the human body can detect many kinds of viral infection on their own, i.e., with no assistance from the cells of the immune system.  As this recent article from Science shows, we are still far from understanding how such mechanisms actually work.  My high-level take on this article, as a computer security researcher, is that:

  • Basically all cells in mammals (and, I think, most animals in general) can generate immune system signals that generate responses from internal and external mechanisms.  A key source for such signals is foreign RNA (code) inside the cytoplasm of a cell.  Of course, there is a lot of other, “self”-RNA in that cytoplasm as well – so how does the cell tell the difference between them?
  • A key heuristic is that native RNA is only copied in the nucleus of a cell; RNA-based viruses, however, need to make RNA copies in the cytoplasm (that’s where they end up after getting injected and it isn’t easy to get into the nucleus – code basically only goes out, it hardly ever goes in).  RNA polymerases (RNA copiers) all use the same basic patterns to mark where copying should start.  Receptors such as RIG-I detect RNA with “copy me” signals (5′-PPP) in places where no copying should occur (the cytoplasm).
  • Of course, this is biology, so the picture isn’t so clear-cut.  A simple “copy me” signal won’t trigger a response; there must also be some base pairing – the RNA molecule must fold back on itself or be bound to another (partially complementary) RNA molecule.  I’d guess this additional constraint is there because normal messenger RNA is strictly single-stranded.   (Indeed, kinks or pairing in messenger RNA are bad in general because they’ll interfere with the creation of proteins.)

Of course, all of this is partial information – there’s evidence that these foreign RNA-detecting molecules (the RLR-family) trigger under other additional constraints.  This doesn’t surprise me either, as this mechanism must operate with extremely low false positives; one or two matching rules aren’t up to the task given the complexity of cellular machinery and (more importantly) given the evolution of viruses to circumvent these protections.  Viruses have evolved ways to shutdown or suppress RLR-related receptors.  Although cells will be pushed to evolve anti-circumvention mechanisms, in practice this is limited in the cellular environment—make the detectors too sensitive and a cell will kill itself spontaneously!  The solution has been to keep a circumventable but highly accurate detector in place; the arms race instead has moved to optimizing the larger immune system.

I leave any conclusions regarding the design of computer defenses as an exercise for the reader. 🙂

DNSSEC and the Financial Meltdown

This week in our group meeting Alex gave a presentation about the status of DNSSEC.  DNSSEC is supposed to improve the security of the Domain Name System (DNS) by cryptographically signing DNS responses.  Thus, with DNSSEC, you can be sure that when you visit, you are visiting a machine (IP address) that is actually associated with Google, rather than visiting some random attacker’s website.  Recently a number of DNS vulnerabilities have been found that make it very easy (under some circumstances) to forge DNS responses, so the security case for DNSSEC would appear to be very strong.  By the end of our discussion, however, we had reached a very different conclusion.  Let me explain.

First, let’s assume that DNSSEC is adopted in record time, say within the next year – 95%+ penetration on secured servers and clients.  Next, let’s assume that all the major security problems with DNSSEC – such as the lack of key revocation – have been resolved.  In this hypothetical world, we would now have a DNS infrastructure hardened against forgery attacks.  Mission accomplished, right?  Maybe not.  In fact, I think there’s a good chance that we would actually be in worse shape than we are now.  Things would be worse because the Internet would become both less reliable and less secure.  These problems would arise precisely because of the success of DNSSEC.

The key insight is that a successful DNSSEC would inevitably kill SSL certificates; instead, SSL would just use the keys conveyed by DNSSEC.  Why bother maintaining two sets of cryptographic credentials when you can get away with one?  Once this happens, the incentives for breaking DNSSEC become enormous.

And break it they will, because DNS admins at all levels have minimal experience safeguarding cryptographic credentials – they know how to keep servers running, not how to keep secrets.  The first priority with DNS will always be availability, and such availability in DNS means that entries have to be changed with short notice.  Therefore, many more people will have access to domain signing keys than should from a security perspective.  Thus, attackers will get the keys.  And those keys will be trusted even more than SSL certificates, because they will be used to block network connections.

So, in an effort to secure DNS, we will make DNS less reliable (because it will be harder to make timely updates) and we’ll make the Internet less secure (because connections to secure websites will be authenticated using much less reliable signatures).

We currently have some faith in cryptographic credentials because they are issued by parties that value their reputation for security (because their business depends upon it).  Instead, we’re going to make organizations who have a reputation for reliability – DNS registrars – and we’re going to give them a fundamental security responsibility that detracts from their core mission of reliability.  The parties implementing DNSSEC actually have significant incentives to trade off security for reliability, even though everybody else on the Internet will have an increasing requirement that DNS be secure (because it will replace SSL certificates).

So, what’s the connection to the financial meltdown?  Well, that meltdown can be, in part, attributed to mismatches between incentives and expectations of trustworthiness.  Credit rating agencies were expected to look out for buyers of securities but were paid by the sellers of securities.  Developers of mortgage-backed securities expected banks to continue to make loans as they had in the past, even though mortgage-backed securities gave banks every incentive to be careless when giving out loans – if the loans went bad, they wouldn’t lose any money because they’d sold the loan to somebody else!

New technologies, whether DNSSEC or financial securitization, inevitably have secondary effects on human decision making.  Technologists must realize that tools designed to increase security or manage risk can, in practice, lead to reduced security or disastrous levels of risk.

everyone in the financial world having too much faith in the models underlying structured financial instruments (such as mortgage-backed securities).  Those models became untrustworthy the moment they were used to create structured financial instruments, because such instruments removed the incentives for banks to be careful when giving out loans (they no longer faced the risk of mortgage defaults).

In our discussion, Luc made an interesting point that reliability and security are generally in conflict in practice, but yet system designers seem to keep wanting to do both at the same time.  I think there’s a deep insight here, but I’m not quite sure what it is.  All I do know is that if we’re going to get reliability and security, we need more flexible ways to manage the trade-offs between them.