I have to admit that the answer is rather lengthy, so here’s a little abstract: The first few sections — Basic Knowledge, Levels of detail, Naming and Transfer of control — cover the basics of procedural programming, and are intended for people who had no prior experience with programming.
The sections Abstraction, Reduction and Algebraic notions introduce the notions of semigroup, monoid and group and show their usefulness in programming. The topics they present are interesting on their own, but they are not essential for understanding “the purpose and benefits of monads”. They are included so that the readers who have heard that “a monad is a kind of monoid” don’t feel confused or deceived, but if you’re only interested in monads, feel free to skip it.
The section Lambda-calculus is a must read to those who don’t know it yet. You shouldn’t be freightened though as it’s actually very simple.
The section called Sequencing is where the explanation of “the purpose and benefits of monads” actually starts, but if you happen to know the let*
form from Scheme/Common Lisp, you can safely skip it.
The two last sections, Callbacks and Monads, finally are the core of the explanation. There’s also a TL; DR section at the end for the impatient, and starting from there might actually be a good idea.
It took me a couple of days to write this answer, so I wouldn’t be surprised if it also took some time to read it. I am rather confident that it’s worth it. I wish I had something like this available when I was first exposed to monads, and I’ve found all the tutorials I’ve read so far abstractly awkward and rather misleading. Of course, if you have any questions or remarks, feel free to leave a comment.
It is also now clear to me that I could have made this answer much shorter, as the section on monoids is rather unnecessary. I do believe that it both theoretically interesting and practically valuable, so I decided to leave it.
Last but not least, I think that the practically inclined readers would get most out of this answer if they familiarized themselves with either Haskell’s “do” notation or Scala’s “for-comprehension” — seeing a couple of examples should be enough just to get the taste. Other than that, I don’t assume any familiarity with Haskell or Scala, although there are some technical remarks at the end that can only be understood by the people who have at least a rough idea of Haskell.
Basic knowledge
I shall assume that you intended to target the response to the “non-programmers” in order to avoid technical jargon and some presupposed knowledge about programming. This makes the task a bit more interesting, since it first requires to explain what programming is.
Otherwise I would just assume that, in a sense, we are all programmers, because even if we are not programming computers, we sometimes happen to “program” other people, especially if we understand programming computers as “making computer do things”, because then we would interpret “making people do things” as “programming people” (even though it may sound a bit creepy).
Just like with people, in order to make computers do what we want, we need a way to make them “understand” our will. And just like with people, we typically use some sort of a language to express it.
At the very core, the languages that typical computers can understand is very simple: it can execute some primitive instructions like moving the memory content from one memory cell to another, performing simple operations like adding or subtracting or multiplying the values of some two memory cells together, or comparing the values in two memory cells and acting in different ways depending on the result of that comparison.
For example, here's a program that computes the factorial of the value contained in the memory cell X0
, and places the result in the memory cell X1
.
- X1 := 1
- IF X0 = 0 GOTO 6.
- X1 := X1 * X0
- X0 := X0 - 1
- GOTO 2.
- STOP
Each line corresponds to a particular step of computation, but the GOTO
instruction makes the computer go to a particular line, so in this example when the program begins, then after executing step 1 it proceeds directly to step 2. From step 2, it can either go to step 6 (if the value in the memory cell X0
is 0) or step 3 (if the value in X0
is non-zero). From step 3 it proceeds to step 4, from step 4 to step 5, and from step 5 to step 2 (always). Lines 2-5 are often called a "loop", because they can be executed many times repeatedly.
The current (or next, if you please) step of computation is stored in a special memory cell called “instruction pointer” or “program counter”, and what the GOTO
instruction does is that it alters the content of that register non-linearly.
Of course, this is not the actual language of the computers, because computers encode these instructions as sequences of zeros and ones, but this is how we can imagine what's going on.
Levels of detail
There are a few problems with this language or programming model. Although the example that I've shown is relatively simple, it is already frustrating to read. Explaining this code requires a lot of comments, and we do not make any use of one of the most important and fundamental abilities of our minds, namely — the ability to name things.
Also, as programs get larger, it may be really difficult to track their control flow, which is problematic, because the order of things going on in the program may be very different from the order in which they appear in the source code.
Another problem is that this approach would only work if we only had one kind of a computer. But there are many different kinds of computers with different primitive instructions that can be performed on them. For example, some kinds of computers give you an instruction to multiply two numbers, and others may require you to write a special sub-program that would perform that multiplication for you (for example using loops, addition and subtraction).
In other words, this approach forces us to over-specify things.
But probably the worst thing is that this approach doesn't let us express what we want, but rather how to make the computer do what we want.
Naming
The problem of naming was probably the first one to be addressed. In a language that supports naming, the above program could be expresssed as
- SUBROUTINE FACTORIAL(N):
- RESULT := 1
- LOOP:
- IF N = 0 GOTO END
- RESULT := RESULT * N
- N := N - 1
- GOTO LOOP
- END:
- RETURN RESULT
Since everything is named (and parameterized), we don't need that many comments: we can now easily guess that the intention of the author of this code was to have a subroutine that computes a factorial, and that the RESULT
variable is used to store the result.
Transfer of control
However, the unconstrained use of GOTO
can still lead to code that is difficult to follow. This problem was solved with the paradigm called "structural programming", where you use some specific control structires such as the WHILE
loop.
- SUBROUTINE FACTORIAL(N):
- RESULT := 1
- WHILE N > 0 DO
- RESULT := RESULT * N
- N := N - 1
- ENDWHILE
- RETURN RESULT
In this particular example the WHILE
loop doesn't buy us much, but working with a code base that has no GOTO
s in it is much easier in principle, because you can reason about many things locally (of course, programmers can still do things to make the analysis of the control flow more difficult).
Abstraction
Note however, that it is rarely our desire to drive the control flow of a computer in a particular way — and yet we are forced to deal with it somehow.
If we wish to have a factorial of a number, we don't necessarily need a computer to perform a loop. Perhaps if it already computed some factorials, it could store them in some table, and maybe it could work faster if it just fetched the values from that table, instead of running a loop.
Or maybe it is connected to a network that has some idle computers in it, and it could ask these computers to help in performing this computation?
We may therefore wish to be able to specify what we want, and leave it to the runtime system to decide what should actually happen.
So for example we could wish to be able to define the factorial in the language of mathematics, or some language inspired by that language, like Haskell:
- factorial 0 = 1
- factorial n = n * factorial(n - 1)
Note that there is no execution flow whatsoever here, and this formulation is way shorter. There are just two equations that establish the meaning of the word "factorial".
Reduction
We could define factorial in several other ways. In particular,you may encounter definitions that look more or less like this:
- factorial n = fold (*) [1..n]
This definition probably won't work for you in Haskell, because the fold
function isn't defined. However, the idea of fold
is that it takes a binary operator, say, ▽
(where ▽
may stand for a two-argument function such as addition or multiplication), and a list, say [a, b, c, d]
, and produces the value of expression a ▽ b ▽ c ▽ d
.
Note that the above formulation is ambiguous, because we don't know the order in which expressions should be reduced. If we interpret the expression as (((a ▽ b) ▽ c) ▽ d)
, we get a left fold, and if we interpret it as (a ▽ (b ▽ (c ▽ d)))
, we get a right fold.
Lastly, if we balance the parentheses, we'd get something like ((a ▽ b) ▽ (c ▽ d))
that we could call balanced-fold.
Of course, if the operator ▽
is associative (as it is the case as, for example, addition or multiplication), that is, if it is the case that (a ▽ b) ▽ c = a ▽ (b ▽ c)
for all a
, b
and c
, then — for the value of the result — it doesn't matter how we place the parentheses.
This nevertheless does matter from the operational perspective. Let's say we wish to compute the factorial of 8. If we use left or right fold to do that (or the FACTORIAL
procedure defined earlier), then we need 8 steps to get the result.
On the other hand, given a computer with 4 cores, we can obtain the result in just 3 steps: first, all the cores multiply the numbers pairwise, yielding 4 partial products. In the second step we just need 2 cores that will perform 2 multiplications for us. In the final step, we just need the one core.
[math]\begin{array}{c|cccc} Step & \textbf{Core 1} & \textbf{Core 2} & \textbf{Core 3} & \textbf{Core4} \\\hline1 & 1 \cdot 2 & 3 \cdot 4 & 5 \cdot 6 & 7 \cdot 8 \\2 & 2 \cdot 12 & - & 30 \cdot 56 & - \\3 & 24 \cdot 1680 & - & - & - \\ & 40320 & - & - & - \end{array}[/math]
You may notice that the number of operations decreases by half in each step, so we arrive at the result rather quickly.
Computer scientists say that the time of getting the result is logarithmic with regard to the size of the problem. While the difference between 3 and 8 may not be particularly convincing, if you had to compute the factorial of 1024, then given 512 cores available, you could obtain the result in just 10 steps, instead of 1024. (This is the inverse of the exponential function that is mentioned, for example, in the Moore's law, which says that the computational power of the computers doubles as some constant period of time passes by. In our case, doubling the data size only increases the number of needed computation steps by one, assuming that we have enough computational resources available).
Algebraic notions
Obviously, calculating factorials is neither the most interesting nor the most useful thing that the computers can do for us, but whatever they do, we like if they do it fast and that our resources are properly utilized. It is therefore reasonable to ask: what other operations (except multiplication) can be performed in the time that is logarithmic with regard to the size of the data being processed?
And it turns out that associativity of a particular operation is the only property that is required to process this data in logarithmic time. So except the multiplication, we are sure that we can, for example, add numbers in a similar manner, or find the maxima and minima in the set of numbers.
Depending on the context, it may or may not be useful to know that mathematicians call a set with an associative operation a semigroup. Knowing that a particular domain is a semigroup may prompt you with implementing the above scheme.
You may have observed that the numbers of data sizes that I have been using in my examples were consequently powers of 2. This is because it is often easier to write an algorithm for a power of 2. However, in real life it rarely happens that our data sizes are powers of 2.
So we have a choice: either we can try to come up with more general, but usually more complex algorithms, or we can extend our data set with elements that won't affect the final result. For example, when multiplying numbers, we could extend the data sets with 1, because multiplying by 1 doesn't change the result. Likewise, adding zero to some value doesn't affect the outcome.
More generally, we say that 0 is a neutral element (or identity element) of addition, and 1 is a neutral element of multiplication.
Of course, mathematicians also came up with a name for semigroups with identity element: they are called monoids. (Moreover, if the each element has its inverse element, then this structure is called a group, and is the subject of study of a vast branch of algebra called Group theory.)
For example, strings with concatenation operation form a monoid, where the empty string is the neutral element. But perhaps more importantly, functions with the operation of function composition form a monoid, where the neutral element is of course the identity function.
Having the notion of identity element solves yet another problem. Recall the formulation of fold
that I gave above. It did not specify what should the value of a fold over an empty list be. Intuitively, the sum of all the elements of an empty list should be 0, for example. For the same reason, the product of all elements of an empty list should be 1.
However, the idea of folding is not limited to monoid, or not even to the things of the same kind. The variants of fold
that are most commonly used in Haskell take an additional argument, the "initial value" e
, so that, for example, you can understand foldl ▽ e [a,b,c,d]
as ((((e ▽ a) ▽ b) ▽ c) ▽ d)
and foldr ▽ e [a,b,c,d]
as (a ▽ (b ▽ (c ▽ (d ▽ e))))
.
I shall give you an example of this. Programmers often use tools called "version control systems", which track changes that they make in their programs.
Often these tools do not make complete copies for each version of the program, but only store the differences between subsequent version. These differences are usually called patches, and indicate which lines of which files were changed, and how. You can think of the program as a book and of patch as erratum that contains a list of misprints or other errors.
Now given the list of patches, say, [a,b,c,d]
, and an initial version of the program, say, e, if ►
is the operation of applying a patch (which takes a program and a patch and produces a new version of the program), you can obtain the latest version of the program by calculating foldl ► e [a,b,c,d]
.
Of course, completing this operation requires the time proportional to the number of patches (however, if you had a way of merging two patches together, you could merge them all in logarithmic time and only apply the patch once).
I hope that so far I've managed to show you how some notions from algebra — and the notion of monoid in particular — can be useful to programmers.
Before we get to monads, we need to explain a couple of other things, though.
Lambda-calculus
So far we've emphasized the importance of naming in the practice of programming to some extent, but it turns out that naming is actually essential to programming. This is reflected in one of the oldest programming languages ever invented, namely — lambda-calculus.
Admittedly, "lambda-calculus" is not the most appealing name imaginable. It would more accurately be called "the calculus of substitutions", because it captures the rules that allow us to substitute names with other names or values.
In the remainder of this text, I am going to use the syntax of the Lisp s-expression to present lambda-expressions, because it makes structure more obvious (and in particular — structure manipulation, which should turn out useful later). Keep in mind that the presentation is going to differ slightly from the one that can be commonly found in the literature.
The expressions in lambda-calculus can either be (1) variables, (2) lambda-abstractions (functions) or (3) applications. In particular:
1. Some examples of variables are things like x
, +
or and
.
2. Lambda-abstractions begin with the lambda
keyword followed by a list of parameters followed by an expression, i.e. (lambda (parameters ...) expression)
, for example, (lambda (x) x)
or (lambda (x y) y)
. Intuitively, the first one corresponds to the identity function, and the second one is just a function of two parameters that returns its second parameter.
3. Application is expressed by putting two or more expressions beside, for example (+ x y)
. The leftmost expression is interpreted as a function (or "the expression being applied to"), and the remaining ones are its arguments.
Lambda-calculus is governed by the following rules:
α. Lambda-abstraction captures the idea of binding. For example, in the expression (lambda (x) (f x))
the symbol x
is bound to the argument of lambda
, and the symbol f
is unbound (it occurs free). Intuitively, the meaning of this expression is exactly the same as that of (lambda (y) (f y))
.
It could be tempting to say that "we can rename the parameter in lambda-abstractions however we like", but this isn't actually the case. For example, what would happen if we wanted to rename the x
argument in (lambda (x) (f x))
to f
?
We would get (lambda (f) (f f))
, which is obviously a different expression, because it does not contain the unbound symbol. This is exactly what the first of the three rules in lambda-calculus tells us: if we want to rename an argument of a lambda-abstraction, we must make sure that (1) all free occurences of this argument will be replaced and that (2) the new name that we choose for the argument does not occur free in the the lambda-abstraction.
β. The second rule of lambda-calculus explains how we can reduce complex lambda-expressions. It says that if we have an application of a lambda-abstraction to some arguments, then we can replace this expression with the body of that lambda-abstraction, substituting all occurences of the parameters with the values of corresponding arguments.
For example, we can reduce the expression
- ((lambda (x) (x is mortal)) Socrates)
to
- (Socrates is mortal)
Likewise, the expression
- ((lambda (x y) (x knew y)) Socrates Plato)
can be reduced to
- (Socrates knew Plato)
and the expression
- ((lambda (x) (x knew x)) Socrates)
to
- (Socrates knew Socrates)
It's really that simple. There's also a third rule, but since it isn't particularly interesting to non-logicans, I won't be covering it here. Likewise, I won't explain how it is possible that the two rules that I presented form a complete programming language capable of expressing loops, because this is immaterial to the question. If you’re interested in that, you may want to check out Panicz Godek's answer to How can every program be built just by creating variables and functions?.
What is important is to imagine that the execution of a program can be perceived as a reduction of a complex lambda-expression (we apply the second rule (β) until it cannot be applied anymore).
An application
Even though lambda-calculus is powerful enough to express every program that can be expressed, there is a set of simple conventions that makes reading and writing the programs written in it much easier.
In particular, it is often convenient to write
- (let ((<name> <value>) ...) <body>)
instead of
- ((lambda (<name> ...) <body>) <value> ...)
For example, suppose that we wish to find the roots of a quadratic equation [math]ax^2 + bx + c[/math] for some given [math]a[/math], [math]b[/math] and [math]c[/math]. Mathematicians instruct us to calculate the value of [math]\Delta = b^2 - 4\cdot a\cdot c[/math], and that — assuming that [math]\Delta \ge 0[/math] — the roots can be found at [math]\frac{-b-\sqrt{\Delta}}{2a}[/math] and [math]\frac{-b+\sqrt{\Delta}}{2a}[/math].
- (lambda (a b c)
- (let ((delta (- (* b b) (* 4 a c))))
- (values (/ (- (- b) (sqrt delta))
- (* 2 a))
- (/ (+ (- b) (sqrt delta))
- (* 2 a)))))
We have used the let form to avoid recalculating delta twice. You can see however, that there are some other complex expressions that appear in this formula twice, such as (sqrt delta)
, (* 2 a)
or (- b)
. We could also use the let form to take care of those:
- (lambda (a b c)
- (let ((delta (- (* b b) (* 4 a c))))
- (let ((-b (- b))
- (sqrt/delta (sqrt delta))
- (2a (* 2 a)))
- (values (/ (- -b sqrt/delta)
- 2a)
- (/ (+ -b sqrt/delta)
- 2a)))))
(We have used rather strange names, namely sqrt/delta
, -b
and 2a
. Many popular programming langauges do not allow to create names like this, and would rather treat those as compound expressions, but the s-expression syntax only treats parentheses and white spaces specially, and any sequence of non-whitespaces and non-parentheses can be used as a name of a variable, assuming that it cannot be interpreted as a number).
You may have noticed that our expression contains a handful of free variables, namely +
, -
, *
, /
and values
. We assume that the former denote the commonly mathematical operations of addition, subtraction, multiplication and division, and the latter provides some means of returning multiple values.
Also, it should be rather clear that the value of the inner binding sqrt/delta
refers to the outer binding delta
. This pattern is called "sequencing".
Sequencing
It turns out that sequencing is so common that it is often convenient to provide another special form, let*
, to avoid nesting of expressions, so that
- (let* ((<name-1> <value-1>) (<name-2> <value-2>) ...) <body>)
is equivalent to
- (let ((<name-1> <value-1>))
- (let* ((<name-2> <value-2>) ...)
- <body>))
and (let* () <body>)
is the same as <body>
.
You may notice that this style of programming gets us really close to the sequential mode of computation that we were dealing with in out factorial example. Note that the instruction set of our computer didn't allow us to write complex expressions such as [math]b^2 - 4\cdot a\cdot c[/math]. Instead we would rather have to write something like:
- X0 := B*B
- X1 := A*C
- X1 := 4*X1
- X0 := X0-X1
- ...
we can express a very similar thing using the let*
form:
- (let* ((X0 (* B B))
- (X1 (* A C))
- (X1 (* 4 X1))
- (X0 (- X0 X1)))
- ...)
Note that there is difference in that while in the first example, the content of registers X0
and X1
is overwritten with new values, while in the second example the names X1
and X0
merely shadow the old bindings.
What is more important is that we are able to express the idea of steps of computation in lambda-calculus.
Lastly, unlike the original formulation of [math]\Delta[/math], the code now performs the computations in a fixed order: it first computes [math]b^2[/math], then [math]a\cdot c[/math], and so on. But this order isn't a part of our task. It is just some accidental complexity that was introduced because of the constraints of our solution (i.e. that computers are sequential machines).
Of course, it sometimes happens that the order of execution is actually crucial. We don't want our applications to send any sensitive user data before the user logs in or after the user logs out, for example.
Callbacks
So far we've assumed that the machine instructions execute directly one after another. However there are situations when this is not the case. For example, when we have an application that communicates with a server, we may need to wait for the response before proceeding with our actions.
Web developers typically solve this issue by registering a callback that shall be executed when the response is received.
Consider the example of a two-stage authorization, where we send our login first, and after receiving the confirmation, we also send a password, and after succeeding we execute some code.
The JavaScript code that does this would look something like this:
- request(server, login, function(response) {
- if(response == OK) {
- request(server, password, function(response) {
- if(response == OK) {
- some_action
- }
- })
- }
- })
It should be easy to imagine the nesting getting more annoying when there’s more than two steps of communication.
Also, you may see that this nesting resembles the nested let
forms that we replaced with a let*
form.
Let's rewrite the above code to the S-expression form, to make the syntactic manipulations more obvious:
- (request server login
- (lambda (response)
- (if (== response OK)
- (request server password
- (lambda (response)
- (if (== response OK)
- some-action))))))
We could introduce a new form similar to the let*
, say, talk
, so that
- (talk <server> <action>)
simply expands to <action>
, and
- (talk <server> <message> <messages> ... <action>)
expands to
- (request <server> <message>
- (lambda (response)
- (if (== response OK)
- (talk <server> <messages> ... <action>))))
and our code for communicating would take the form:
- (talk server login password some-action)
The pattern (or anti-pattern) of linearly increasing level of nesting of the code is called Pyramid of doom and appears in many different contexts. It can be a pain to deal with, because it often obscures the essence of what you’re trying to do.
Monads, finally
So far we have seen that we can usually avoid the pyramid of doom rather easily by introducing new syntactic forms to our language. However, for some reason many programmers do not like to use languages whose syntax is based on s-expressions and prefer other syntaxes that correspond better to their whims. Since these syntaxes make the introduction of new syntactic forms to the language really difficult, they come up with other means to avoid the pyramid of doom.
They suggest that it is a good idea to have a single special form that is able to convert nesting into sequencing, that is implicitly parameterized with some sort of composition operator (this special form is called "do-notation" in Haskell and "for-comprehension" in Scala). The implicit parametrization occurs with the help from the type system (for example, in Haskell it makes use of the mechanism called type classes).
Since leaving things implicit is rarely helpful to comprehension, we could of course provide the composition operator explicitly. We could imagine having a form, say, subsequently
, such that
- (subsequently <combine> () <body>)
would simply expand to
- <body>
and
- (subsequently <combine> ((<name-1> <value-1>)
- (<name-2> <value-2>) ...)
- <body>)
would expand to
- (<combine> <value-1>
- (lambda (<name-1>)
- (subsequently <combine> ((<name-2> <value-2>) ...)
- <body>)))
It is easy to see that if (pass value function)
is defined simply as (function value)
, then (subsequently pass bindings body)
is equivalent to (let* bindings body)
.
Similarly, if ((say server) message react)
is defined as
- (request server message
- (lambda (response)
- (if (== response OK)
- (react response))))
then
- (subsequently (say server) ((_ messages) ...) action)
is equivalent to
- (talk server messages ... action)
(the purpose of the _
symbol is to indicate that the response values to the particular messages are ignored)
You may have noticed that the composition operators that we've used here, namely pass
and (say server)
are somewhat similar: they both take a value and a function that can take this value to produce some result, which in turn can be passed further.
For some reason that I find difficult to comprehend, the developers of Haskell decided that its sequencing operator should be parameterized not only with the composition operator (which is written as >>=
and pronounced "bind" in Haskell), but there should also be a "unit" operator that does nothing interesting — or to be more precise, it does nothing in some particular sense (it is called return
in Haskell, probably to confuse programmers with background in imperative programming).
Arguably, there are some "monadic laws" that ought to be satisfied by these operators. The first law states associativity (or rather "something resembling associativity") of the composition operator (which doesn't make too much sense in practice), and the second states that the unit operator is a neutral element of this composition.
These laws may remind you the definition of monoids. This should explain at least some bit of the popular phrase "monads are just monoids in the category of endofunctors, what's the problem?".
The problem is that neither the notion of category, nor the notion of an endofunctor gives any clues whatsoever with regard to sequencing. Moreover, the high priests of Haskell often claim that Haskell has "monadic IO", because it uses the sequencing mechanism I've described. However, as Conal Elliott noted, since the IO
type has no identity, the monadic laws for "the IO monad" cannot be meaningfully formulated, and therefore "the IO monad" is not a monad (at least not in the sense that the category theory has in mind — it may be a monad in the sense that it implements the Haskell's Monad interface, but this only proves that in programming, "monad" is nothing more than just an awkward name).
TL; DR they allow to replace nesting with sequencing in languages that have a poor support for macros but offer sufficiently elaborate type systems. And they probably allow some programmers to look smart. The name "monad" is terrible, because it focuses on some irrelevant (and unnecessary) details, giving no clue of the purpose whatsoever (truly, the name "sequencing" would be way better).