A secure computing platform is overdue
Computers are insecure. They’re insecure at every layer. CPUs are insecure. Kernels are insecure. Apps are insecure. Third-party libraries are insecure. Networks are insecure. And so on.
We’ve had decades of hacks, costing $180 billion per year just in the US. Considering the entire world, we’ve already wasted trillions of dollars on bad computer security.
We’re using building blocks like OSs or programming languages that were designed before security became a crisis, and are therefore insecure, and we’re trying to retrofit security onto them. That produces only limited benefits. Security needs to be designed for from day one.
Unfortunately, most attempts at improving security are hobbled by needing to remain compatible with legacy hardware, networks, apps or OSs. To break free from that constraint, we’ll redesign the entire platform end-to-end, not just one component like the programming language, so as to break free from backward-compatibility. For example, how can we improve an OS if we need to support only our own programming language? We’ll aim for a global optimum, not a local one.
Here, I’ll take an ambitious and aggressive stance prioritising security over other attributes like resource consumption, cost or backward-compatibility. We’ll make decisions that trade off resources for security . Environments where every dollar matters are not our target market. Similarly, if some of our choices require more battery life, and this excludes battery-powered devices (phones, tablets, laptops) from our target market, that’s fine. Neither will we worry about backward-compatibility, whether with existing software, hardware or networks, or even with older versions of our own platform. Whenever there’s a tradeoff between security and X, we’ll choose security.
Engineering is all about tradeoffs, and in this post, I want to explore different tradeoffs from what platforms take. Taking a fresh look at a problem while subjecting ourselves to the same constraints forces us to reach more or less the same conclusion as before, taking away from the point of this entire exercise.
This makes our platform suitable for only some users, a minority who put security at the top of the list when asked what they want. It won’t be a mass-market platform. Your next phone or laptop won’t run this platform. Neither will the cheapest cloud server you can rent on Digital Ocean. The developers of a podcast app are unlikely to choose this platform for their backend, since it doesn’t handle any especially secure information. And so on.
So, who is it for? First, any system which if fails or is hacked causes people to die: medical devices that control body functions, or inject insulin, for example. If a patient is in emergency in a hospital, and his records are stored only electronically, you don’t want him to die because the system was held ransom. You don’t want your plane to crash because of faulty avionics software, or hackers to remotely take control of your car and disable its brakes or turn the steering wheel hard to a side when traveling at high speed, killing you. Any kind of critical infrastructure like a city’s water supply system, a metro network, or lifts would also be a good candidate for a secure computing platform. Or industrial software that could kill workers standing next to malfunctioning equipment. Sensitive installations like dams, nuclear reactors, nuclear fuel enrichment plants or defence installations. Organisations that are a high-value target, like security agencies like the CBI or FBI. Individuals that are a high-value target, like Edward Snowden. And so on. There’s a long list of use cases where security deserves a higher prioritisation than it is now.
If our design decisions are too aggressive even for these markets, we can always decide later to back off some decisions. That’s better than preemptively watering down our design for imagined performance or compatibility considerations.
So, let’s see how the system will be designed:
Managed code
Everything will be written in a managed code: apps, the OS, device drivers, everything, except for the absolute minimum that you can’t write in managed code. This language won’t have C-like pointers, because pointers prevent you from making any other security guarantees. This means either garbage-collection or automatic reference-counting [1]. Array indices will be checked. Variables will be compulsorily initialised to avoid leaking data.
As much of the standard library will be implemented in managed code. For example, a Java implementation might implement ArrayList in Java, or in native code. The former is more secure, since you can’t use memory after freeing it, double free it, access some other object’s memory, and so on. The standard library is coded by humans, and humans make mistakes, even the best of us, so the standard library will be in managed code. (This has nothing to do with VMs vs ahead-of-time compilation — if Java didn’t run on a VM, and compiled to native code, the same principle would apply. In fact, there are such compilers.)
The standard library implementation will also run with the privileges of application code, and not be all-powerful. If an app invokes an XML parser that’s part of the standard library, and the app itself doesn’t have access to modify some files, neither will the parser. The XML parser is part of the standard library just for convenience. That shouldn’t make it any more privileged than one you write yourself or download from Github.
As another example, the Go garbage-collector is written in a Go subset that doesn’t have dynamic memory allocation. This again protects the GC from security vulnerabilities that are possible in C++.
You should be able to easily sandbox a third-party library, so that a buggy XML parser won’t be able to delete files on the disk. Or, if you’re using an ad library in your app, the ad library should be prevented from misusing your contacts permission to upload contacts to its own server.
If a library uses another library, it should be possible to have a nested sandbox, to further reduce the access of the second library compared to what the first one has. For example, if a database library like SQLite uses an SQL parser, the SQL parser shouldn’t be able to read files, while the database itself can and does need to. In this way, the language should permit and encourage splitting its functionality into multiple nested sandboxes, each with the minimum access needed to do its job.
The outermost sandbox is the app sandbox itself, which the OS uses to limit what the app can do.
OS
As I mentioned earlier, as much of the OS as possible will be written in managed code.
Apps will be sandboxed, without requiring memory protection support from the hardware to do so. In other words, instead of letting apps write to arbitrary memory locations, and then using address spaces to contain them, apps won’t be able to write to other apps’ memory. Don’t make something insecure and then build another layer that tries to control what the first one can do. Instead make it secure to begin with.
Address spaces will still be used a second line of defence. If a process accesses memory it shouldn’t access, or make a system call it shouldn’t, or otherwise misbehave, that means a bug was found in the managed code implementation, so the team would treat it as a mission-critical bug, stop what they’re doing, and investigate what went wrong and what to do to prevent it from happening again. This would be the equivalent of opening the emergency door of a plane that’s stopped. The airline wouldn’t be casual about it saying that since the plane has already stopped, no harm was done. Likewise, a process going bad shouldn’t happen and will be treated as a major breakdown of security if it did happen.
We might run each app in a virtual OS as a third line of defence. This virtual OS would have only one job — to forward system calls to the real OS. Again, if the VMM found the virtual OS doing something bad, it will be treated as a severe breakdown of two levels of security — the managed code implementation, and the process boundary within the virtual OS.
In addition to apps, components of the kernel will also be sandboxed so that a bug in the TLS stack will be able to drop or corrupt network traffic, but not let attackers read memory, as with the Heartbleed crisis. Our OS will use a microkernel, in the sense that where subsystems that traditionally reside in the kernel, like the network stack and filesystems, are separated out and sandboxed from each other and from the kernel.
If any OS APIs turn out to be insecure, we will give app developers a month or a quarter’s notice and then remove these APIs. Not deprecate them, but remove them, or change them to do nothing, or to always return an error code. This is different from, say, C, where strcpy is still around despite being known for decades to be insecure. This particular problem won’t occur in managed code, but if it hypothetically did, we’d change strcpy to be a noop. Or delete it, and let apps that use it fail to compile. Developers that can’t be bothered to secure their app even after being given notice can have their apps break.
Apps will require permissions for things like networking, contacts and so on, similar to Android. On install, an app will start with zero permissions, and will have to prompt the user for each one. But an app will be allowed to prompt only for permissions specified in its metadata. That way, if an app that shouldn’t access your contacts is compromised, or if the developers are sneaky, or it’s malware masquerading as a genuine app like Notepad, it would crash when it asked the OS to prompt the user for contacts access. And the app would be flagged as compromised and quarantined. It would no longer run until an admin removed it from quarantine.
Storage
Each app will have its own sandboxed filesystem, like on iOS or Android. It won’t be able to access anything outside it unless it was explicitly shared by another app or authorised by the user via a File Open dialog box.
Files on disk will be encrypted by default. Perhaps you’ll be able to opt out when not required. This brings up a general principle that security is effective only if it’s the default, and you need to take action to turn it off when needed, rather than keeping it off by default and requiring people who want it to turn it on. An oversight—which will eventually happen—should lead to more security, not less.
File permissions, whether among apps or users, will be implemented by sharing the encryption key. If you don’t have the key, you can’t access the data, even if you were to somehow break the filesystem’s access control. Likewise, if an app tries to tamper with a file owned by some other app, or a user tries to tamper with a file owned by some other user, even if they managed to modify the raw bytes of the file, since they don’t have the key, it won’t decrypt, and the actual owner will immediately be alerted to the tampering.
Apps will be check-summed. If an app tries to modify itself, or malware tries to do so, the app will no longer run.
This goes for the OS as well—the OS image will be checksummed, so it can’t be modified. This is an improvement over desktop OSs like Windows, macOS and Linux that let root modify parts of the OS.
Open-source
The OS code will be open for everyone to see and improve.
Apps will all be distributed in source form. The OS will only install apps from source, not binary.
Apps will have a license granting users the right to see the code, fix vulnerabilities and otherwise improve security, and share the diffs with other users. The license will also forbid obfuscation. The license won’t grant users the right to add new features, improve the UI, make it faster, port it to a new platform, use it as a baseline to build their own app, and so on. They can only make security improvements.
Developers will still be able to charge for apps. If you’re concerned that someone could use a modified version of the app free rather than paying for the original version, the license will let you distribute only patches, not the entire source code, and patches will be useless to someone who doesn’t have the baseline code to apply the patch to. And the only legal way to get the baseline code of a paid app will be to pay for it. In addition, the legal terms will say that diffs can be licensed only to those have licensed the app to begin with. That way, it will be illegal for someone to distribute the modified app to people who haven’t bought the original, and illegal for users to use such an app. It will be no different from downloading pirated apps today.
This strikes a good balance between users’ need to audit the code for vulnerabilities and to make security improvements to their apps, and developers’ desire to avoid giving their code to their competitors. This still won’t work for all developers, and that’s fine. As long as our platform is successful, economics will drive necessary apps to be built, if necessary for a higher price than apps on other platforms. In the limit case, big customers can contract out to an outsourcing company to build the apps they need.
What would prevent a developer who doesn’t agree with this position from distributing his source under a license that doesn’t give the user the right to make security improvements? The OS will, if possible, be distributed under a license that makes it illegal for such apps to link with the OS. When the developer uploads the app to the store, the store can require the developer to first sign an agreement requiring them to distribute apps only under the above conditions. In fact, when a developer downloads the developer tools like the IDE, we can require them to agree to these terms.
Distributing apps as source also lets us avoid having to define a bytecode or a binary format, and then be locked in to it, preventing security improvements that could be made if backward-compatibility were not a concern. We might still use a bytecode format, but that will be an implementation detail. We’ll always be able to update it and recompile the source to the new bytecode format. Or we might skip having a VM and have the installer compile source directly to binary the traditional way, as C compilers work. These will all be implementation details.
App store
PCs traditionally ran all programs, except those blacklisted by an antimalware app. Blacklisting doesn’t work. The only secure option is whitelisting. Which means an app store. The OS runs only apps distributed from an app store, like iOS.
If the developer finds security vulnerabilities in their app, he’ll be able to release an update, and blacklist the vulnerable version. Everyone who’s installed a vulnerable version will be immediately upgraded. In other words, the OS checks with the app store every time an app is run, not just during installation [2].
In addition to particular versions being insecure, if the developer himself is found to be a bad actor, he’ll be blacklisted, and his apps will be remotely uninstalled. Or quarantine them, preventing them from running till an admin overrides the quarantine for a month.
You’ll be able to run your own app store, which is just a cloud service that speaks a standard protocol with the OS, and configure your OS to talk to it. When we say that the OS installs apps only from an app store, that doesn’t need to lock you in, giving the vendor control over you [3].
Hardware
So far, we haven’t discussed a very important aspect — the hardware. You can’t build a secure platform on insecure hardware.
We’ll use a licensable ISA like ARM or POWER, not a closed one like x86. You can get a license to build an ARM CPU, but not an x86 CPU. That rules x86 out. Further, we’ll take the results of research on security of ISAs. We’ll take a baseline ISA, remove all insecure aspects of it, and redesign it for security. If a CPU instruction is found to be vulnerable, we can update the managed code implementation to not use it, and then push out a microcode update to the CPU to disable this instruction.
Further, any CPU firmware like the Intel Management Engine, should be open-source, like the OS, and users should be able to permanently disable it. Likewise, drivers, and firmware of other devices like network cards will be open-source. We’ll require DMA to go through an IOMMU for security. And yes, peripheral vendors for which this doesn’t work will have to be excluded from our platform.
Networking
The OS won’t permit any unencrypted network connections. Even LAN connections. Everything must go through TLS or some alternative encryption. In the worst case, we can use self-signed certificates, but even that is better than no encryption since it protects against passive snooping. Mandatory encryption means thatplain HTTP websites or API calls won’t work, and HTTP resources in HTTPS sites won’t load. File sharing protocols that don’t use encryption, like earlier versions of SMB and NFS, won’t work.
DNSSEC will be mandatory, and domains that don’t use it won’t resolve.
We will also improve web security, dropping support for the worst parts of the web platform. If only a quarter of web sites work properly on our platform, that’s fine.
If Ethernet has some flaws, like being able to set up rogue ARP servers, we’ll replace it with a secure variant. And so on.
Conclusion
This is how an OS designed first and foremost for security looks like. We free ourselves of constraints like cost, performance and backward-compatibility with existing apps, OSs, hardware and networks, and instead prioritise security.
This platform makes so many tradeoffs for security that it will initially be appropriate only for especially security-conscious uses like critical infrastructure, defence, and systems that will cause someone to die if they fail. If our platform succeeds, we can then take lessons from it to apply to our mainstream platforms, or build a watered-down version of our platform that is less secure but still far more secure than today’s platforms.
[1] Automatic reference-counting doesn’t let you manually increment or decrement reference counts. The language implementation does that for you. This means there’s no way to prematurely free an object while still holding a pointer to it. It prevents use after free and double-free bugs, guaranteeing the heap’s integrity. It does let you leak memory by creating a strong reference cycle, but that’s not a security bug. It’s only a reliability bug, and out of scope for us.
Alternatively, all references could be zeroing references, which means that the reference is set to null when the target is deallocated. Then we can expose primitives to increment and decrement the reference count, or even an explicit dealloc call like in C, but while guaranteeing heap integrity.
[2] This takes care of updating apps, but for security, libraries used by those apps also need updating. This responsibility is currently borne by developers — they have to track vulnerabilities in every library they use. If there are M apps and N libraries, that’s MxN checks to be done. This doesn’t scale. The app store should let library developers list their libraries as a separate item on the store. Apps can specify a dependency on them. If a particular version of a library is found to have a vulnerability, the developer of the library can release an update and flag the earlier one as insecure. The app store will then update that library.
For this to work, library developers must never make incompatible changes, like removing insecure APIs, or changing their behavior in ways that contradict the documentation. If such changes are required, the developer can list it as a new library altogether like “openssl” and “openssl-2”. This is similar to how iOS developers who have a paid app charge users for an upgrade by listing it as a separate app.
[3] You can build an app store that permits all apps from some other app store, plus your organisation’s internal apps. Or you can build only apps that have been vetted by your security team in addition to the store. That is, you can have an AND or an OR. You can do all this server-side.