Answer to What are the seminal papers in distributed systems? Why?

What are the seminal papers in distributed systems? Why?

Quora User

Bikeshed colour selection consultant. · Upvoted by

Adam D'Angelo

, worked at Facebook and

Alexander Shuiskoff

, MSc Computer Science, Ural State University (2011)Author has 217 answers and 1.6M answer views · Updated 12y ·

Originally Answered: What are the seminal papers in distributed systems? ·

Note: this is slightly biased to the problems of scalable online processing systems (mostly data storage and messaging). As such I may be leaving out papers related to other (equally important) topics such as HPC, security in distributed systems and may be leaving out important papers in theoretical areas such as Byzantine Fault Tolerance and distributed algorithms.

I also am not a researcher in this topic and may be mis-representing some important theorems: please comment or suggest edits in case of mistakes.

This is not an exhaustive list. Henry's list contains several additional papers that are a must read for anybody interested in building practical distributed systems.

"Foundational" papers

Leslie Lamport, "Time, Clocks and Ordering of Events in a Distributed System" http://research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf

This paper is important as it provides for a way to reason about order in distributed system. Concepts such as state machine replication, version vectors also trace their way to this paper.

The concept of logical time also provides a "toolkit" that can be used to prove and disprove theorems in distributed systems.

Consensus, reliable/atomic multicast

A big problem in distributed systems is consensus. That is, how do you enforce that all machines agree upon a certain value. An important "private case" of this problem is the atomic multicast problem, or the case of agreeing upon a message log e.g., a WAL in a database system.

Jim Gray, "Notes on Data Base Operating System"
http://research.microsoft.com/~Gray/papers/DBOS.pdf

This paper introduces two phase commit. It complements the rather elegant idea of state machine replication by providing a way for multiple nodes in a distributed replication to agree on the order of messages (and thus end up in the same end state)

Fisher, Lynch, Patterson, "Impossibility of Distributed Consensus With One Faulty Process"
http://theory.lcs.mit.edu/tds/papers/Lynch/jacm85.pdf

AKA "The FLP Impossibility Result". Before we get too enamored with consensus and beauty of state machine replication, it's important to understand its limitations. This particular paper proves that it's impossible to achieve consensus in an asynchronous network if any one of the nodes has a non-zero possibility of failure. The currently popular CAP Theorem could be argued to be an elegant restatement and a corollary to the FLP impossibility result.

Fault tolerant consensus protocols: 3PC, Paxos, ZAB et al

Paxos is a fault tolerant consensus protocol: it can withstand the failure of a coordinator. It's variants (such as Multi-Paxos and ZAB: [note: it's likely that ZooKeeper committers are reading this, so please don't kill me for claiming that Zab is an exact Paxos variant -- it's more nuanced than that]) have also been widely used in the last decade: Google's Chubby Service, Apache ZooKeeper, BerkeleyDB HA. Most common application is providing a consistent view of a cluster (e.g., membership, leader election).

Leslie Lamport, "Part Time Parliament" and "Paxos Made Simple"
http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf
http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf

Part time parliament introduces the paper through the allegory of a Greek democracy. Personally I really liked that paper's style, but others found it (pun intended) rather Greek to them. Paxos Made Simple put it in English terms.

Industrial paxos papers
Paxos made live, Paxos for system builders, ZooKeeper Atomic Broadcast
http://labs.google.com/papers/paxos_made_live.html
http://www.cnds.jhu.edu/pub/papers/psb_ladis_08.pdf
http://research.yahoo.com/pub/3514

The original Paxos paper solves a somewhat more difficult problem than is practical to implement use in production. These papers present slight modifications of the Paxos algorithms which make it more practical to use (liveness guarantees). ZooKeeper Atomic Broadcast is specifically interesting: it defines new ordering guarantees that are needed to handle passive replication, where state updates are agreed upon instead of operations.

[Note: There's an excellent blog post covering the importance of these papers http://betathoughts.blogspot.com/2007/06/brief-history-of-consensus-2pc-and.html ]

Ken Birman, "Exploiting Virtual Synchrony in Distributed Systems"
http://www.cs.cornell.edu/home/rvr/sys/p123-birman.pdf
Interesting approach to the reliable multicast problem and message oriented middleware (e.g., pub/sub).

Here's a 1993 paper by Cheriton and Skeen on limitations of causally and totally ordered messaging: http://net.pku.edu.cn/~course/cs501/2000/p44-cheriton.pdf

Data replication, naming and routing

Paul Mockapetris, "Development of the Domain Name System"
http://cseweb.ucsd.edu/classes/wi01/cse222/papers/mockapetris-dns-sigcomm88.pdf
A great example of optimistic replication/eventual consistency and an introduction to naming in a distributed system (DNS, which we're using to get to this very site!)

Stoica et al,"Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications"
http://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf
Peer to peer protocol for data lookup

Rowston et al, "Pastry"
http://research.microsoft.com/en-us/um/people/antr/PAST/pastry.pdf
Once you've strummed some chords, you should have a pastry. Another excellent P2P naming/routing paper

Akamai, "Consistent Hashing and Random Trees"
http://www.akamai.com/dl/technical_publications/ConsistenHashingandRandomTreesDistributedCachingprotocolsforrelievingHotSpotsontheworldwideweb.pdf

To take a wild guess, pretty much every popular web site you have open right now in your browser has some content that's delivered via (a variant of) consistent hashing. It could be images delivered from a CDN or S3 or it could be data delivered from a data store such as a Dynamo implementation, custom sharded MySQL/Postgres or (depending on the client) Memcache.

Saito et al, "Optimistic Replication"
http://www.ysaito.com/survey.pdf

As I've mentioned before, strong consistency is not always a practical choice. This papers covers the many well known alternatives to "pessimistic" (or consensus-based/transactional) replication. Perhaps not as seminal as anything written by Lamport, but immediately accessible applicable. Helps put what you've already used (e.g., DNS, cvs/svn/git) in more formal terms.

Google file system: well known paper, I won't go into details here
http://labs.google.com/papers/gfs.html

Xerox PARC, Epidemic algorithms for replicated database maintenance,
http://www.bitsavers.org/pdf/xerox/parc/techReports/CSL-89-1_Epidemic_Algorithms_for_Replicated_Database_Maintenance.pdf

Introduces Gossip algorithms for eventual replication.

Other

Chandra and Toueg, "Unreliable failure detectors and reliable distributed systems",
http://www.cs.cornell.edu/home/sam/FDpapers/CT96-JACM.ps
FLP impossibility result is a theoretical limitation (so be wary when a vendor claims to support full blown ACID transactions and advertises nearly continuous availability with low latency!), but in practice imperfect failure detectors can be used to work around it and build commercial systems.

Leslie Lamport, The Byzantine Generals Problem
http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf

Most fault tolerance and consensus protocols so far deal with "nice" failures (machines are slow, machines are down, machines are sending data out of order or sending partial data). Unfortunately, real world is not like this. There are malicious enemies (compromised machines), and machines acting in a completely erratic matter. This is (as far as I understand) still an open area of research.

Eric Brewer, "Lessons from giant scale Internet services" http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.4274&rep=rep1&type=pdf

Discusses the experiences from building the Inktomi search engine (side note: it's quite fascinating to contrast what was considered "Giant-Scale" in 2001 vs. 2011!). Introduces the concepts of harvest and yield.

MapReduce: http://labs.google.com/papers/mapreduce.html
Needs no introduction (it's a well known paper) but as Edward Ribeiro pointed out in a comment it really took of as a concept even independent of the Google File System and its implementations. Along with Stonebraker's The Case for Shared Nothing (mentioned in Henry Robinson's answer) it has changed the way large-scale data analysis is done.

Interesting recent industrial papers

Yahoo's PNUTs: http://www.brianfrankcooper.net/pubs/pnuts.pdf Interesting in that it combines the idea of scalable message bus (reliable multicast problem) with a relaxed consistency model that's between eventual consistency and full blown strong (consensus-based) consistency.

Amazon's Dynamo:
http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
Even though the major concepts (consistent hashing, vector clocks, eventual consistency) are mentioned in other papers, this paper shows how they can be tied together to build a highly available low-latency key/value store. The choice of availability over consistency is also rather interesting (it is rather uncommon in the database world) as are the business reasons and trade offs of doing so.

9/13/11-- Addendum

There's also great list put together by Swami Sivasubramanian (who has built Amazon's Dynamo and other great distributed systems):

http://scalingsystems.com/2011/09/07/reading-list-for-distributed-systems/

46.2K views ·

View upvotes

View 1 share

1 of 13 answers

Something went wrong. Wait a moment and try again.

View 12 other answers to this question

About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·