Profile photo for Terry Lambert

They are not faster than OS threads; they are in fact a version of N:M scheduling, where there are N goroutines which are backed by M OS threads.

I’m a big fan of N:M scheduling, but it has a lot of complexity, which in turn means there’s a high maintenance cost for the infrastructure, relative to the benefit.

Almost all of the benefit revolves around call conversion scheduling. This is where you take what would be a blocking call, and trade it for a non-blocking call plus a context switch.

The remainder of the benefit comes from voluntary preemption; this is where a thread which would be suspended pending a condition, were the condition check taking place in the OS kernel, instead moves the condition check to user space. The suspended goroutine is placed on a condition wait queue, and there is a context switch — just like in the call conversion scheduler.

N:M threading was invented — and implemented in FreeBSD — specifically to address the issue of partial quantum context switch overhead.

It turns out that making a context switch from one kernel thread to another kernel thread is relatively expensive. You have two system calls of overhead, and full register and context spills, which may not be necessary, but which are difficult to accomplish lazily (Linux was about 8 years old before it started doing lazy context switching; BSD had it from day one), and context reloading.

Effectively, you have four protection domain crossings, and a lot of data movement overhead.

If you can accomplish this in user space, you have zero protection domain crossings, and a much smaller data movement overhead.

N:M threading does this.

Additionally, if you make a blocking call partially into your quantum, you pay that expense to run another thread instead. So if you were given 100ms of quantum, and used on average 33ms of quantum per thread, then you pay 3 heavyweight context switch overheads per quantum — one of which you’d pay anyway, because it’s an involuntary preemption at the expiration of the quantum, because the CPU scheduler implements time sharing between many processes running on a system.

If you do it in user space, instead of 3, you only pay 1. That’s a 66% savings in context switch overhead — N:M threading — such as in goroutines — accomplishes this goal.

My personal statement on on this — if I might quote myself — is:

The OS gave me the quantum; it is my damn quantum, and I am going to run threads in my process with it, if I have another thread ready to run, and I am not giving the remainder of that quantum back to the OS to give to someone else.

That’s goroutines.

But because it divides the scheduling responsibility between the kernel scheduler, and another scheduler — which lives in the goroutines runtime in the library — there’s complexity.

Every time a potentially blocking system call is added to the system, you have to examine it to see if you can make a call conversion using it, or if it has to be a blocking call.

And if you get to the point you are about to block all of your OS threads — your M’s — because you have that many unconvertible blocking operations outstanding, but you have goroutines whose wait condition has been satisfied, so they are ready to run… then you have to start another OS thread (M := M + 1) to back the ready to run goroutines, while you then block the Mth OS thread.

In the limit, it’s possible for M to grow to equal N (if you have N blocking outstanding operations).

Worse, as an implementation detail, almost all user space N:M schedulers of this type tend to keep a separate reserve thread-spawning-thread; so the degenerate case is N effective threads (goroutines) + 1 thread-spawning-thread (N goroutines vs. M + 1 kernel threads).


Typically, the way goroutines are intended to be utilized means that N > M for all cases that are interesting, and usually N >> M (N very much larger than M).


In any case: the answer is complexity, and maintenance burden going forward.


Ultimately, a better way would be to implement an async system call gate, rather than relying on a synchronous system call gate, and implement a pure call-conversion scheme on top of that, in libc itself.

This is how the DEC MTS (Multithreading Services) operated in VMS.

When I was working on the DEC Pathworks For VMS (NetWare) product at Novell in the 1990’s, I added the timer functions into the DEC MTS package — I believe it was written in Bliss? — so that we could support Mentat Streams on top of VMS.

It’s a much more elegant model, since it allows for completion callbacks (as are present in Windows NT and derivative systems), if you access the call gates directly, rather than through the library wrapper.

This is probably to be expected: Dave Cutler was one of the architects of VMS, and was the primary systems architect for Windows NT.

This approach also has overhead; to implement a normal, blocking library system call, you’d either pass down a NULL as the completion routine parameter, or you’d have to explicitly make a second system call to wait for completion (SYS$WAITEFLOR, in VMS parlance), or you’d need to have a completion routine that fired as an AST (Asynchronous System Trap) back to user space on a synthetic context — usually running in ring 1 or ring 2.

Most operating systems on Intel processors run using only two rings: ring 0 (kernel) and ring 3 (user); adding another ring to run the completion routines adds complexity. Windows NT and later has a loose coupling between Kernel threads and user threads which avoids this.

Typically, you’d probably keep around an extra thread to use the AST to fire the completion routine to be scheduled — and then it goes back to sleep, and the user space scheduler runs the completion routine at the next available user space context switch.

Maintenance overhead would drop to near zero additional burdening.

This is not how goroutines work, obviously, since substantial kernel engineering work would be required.

Instead, goroutines are quite messy, and have a lot of associated maintenance overhead.


So the maintenance burdening and additional complexity, which requires someone like me, or — if you’ve understood the above, now you — to write the code, is your likely answer.

View 1 other answer to this question
About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·
© Quora, Inc. 2025