In my experience, most huge systems have:
- Started out with a large, relatively complex, but not huge kernel.
- Grown significantly larger, and locally more complex, due to feature growth.
- Been broadly redesigned several times in large efforts led by senior engineers who understood the majority of the system.
I have in-depth knowledge of a few huge systems, most notably Google’s Borg
. Borg is huge - both in terms of deployment (basically all of Google’s datacenters), and in terms of complexity (I would estimate that around 2000 years of engineering work went into making Borg what it is now).Large and complex kernel
Borg, like most complex systems, was initially designed complex, but nowhere near complex as today. It began with a core idea that could be described in a page of text. To summarize:
- We want to automatically and dynamically schedule workloads into a cell of thousands of individual machines.
- The scheduling will be done by a central “master”, maintaining the state of the whole cell (as a master-replica replicated state).
- Workloads will specify their CPU, disk and RAM requirements, as well as a “priority” that is used by the master to evict lower-priority workloads to make space as needed.
- We will oversubscribe machines, putting lower-priority workload into spaces that higher-priority work reserved, but is not using.
- The assessment of how much resources a workload is using, as well as reacting if oversubscription causes a resource shortage will be done by a superuser-level on-machine agent (Borglet).
There’s more; but I have been able to describe all the basic ideas in the Borg system to a CS university students audience in an hour-long lecture.
This initial idea was implemented probably within an engineering-decade by engineers who knew what they’re doing (in particular, who had experience with Borg’s predecessor, WorkQueue).
Adding more features.
A large part of what happened later was feature growth. SSD was added as a resource. It turned out that the initial assumption - that if a spinning drive breaks, all workloads that were using this spinning drive should be considered dead - was incorrect, and codepaths to allow surviving disk loss were added. Proactive rescheduling to bin-pack the cell better was added.
Additionally, an ecosystem was built around the basic core of the master and the borglet - introspection tools, a configuration language, automation to predict the resource requirements (instead of having people enter them manually in a config), quota management systems, cross-cell scheduling, and many, many more, each one built to address a specific need.
Finally, the system was improved as we went. The initial count-everything-once-every-second resource accounting was gradually replaced with kernel-based mechanisms (cgroups). We overhauled error reporting, because our users couldn’t make sense of what they’re seeing. Read-only GUIs were added at both the Borgmaster and Borglet level, and then a separate cross-cell UI was built with improved capabilities.
All of these are all major systems or functionalities that expand Borg and make it larger and more complex - but they do not necessarily make it deeper. That is, you can still reason about the core system and its APIs with only a superficial knowledge of all these features and ideas.
Major rearchitectures.
Most of the junior engineers working on Borg do not understand the whole Borg system in-depth. They work on, say, Borglet (where I worked), and maybe even on a more specific part, like the memory management system of the Borglet, and add features there; while they do have a rough understanding of the core ideas behind Borg, they probably never touched the master code, let alone any of the satellite systems.
However, I believe that continued success of such large systems also depends on having several senior engineers who understand how the whole system, or large parts of it, work, and understand what parts of it are bending under the strain of new requirements and features. If something is failing to scale - either in terms of growing scale of the system’s usage, or in terms of the maintainability of the system due to the number of features added in a specific area - a larger-scale change of architecture might be needed.
Note that the boundary between a “local refactoring” and a “senior-engineer-level global change” is not a hard one. While there are changes that are sweeping throughout the whole Borg system - I can name two out of hand - there are also major architecture changes that happened to, say the Master (like moving of its state to Paxos) or to Borglet (I described one of them here: As a software engineer, why did you receive a promotion?). The ability to make these broad changes to the system’s architecture (without the system falling apart in the process) is, I believe, critical to having a huge, complex system maintain the ability to grow.
Footnotes