How Haskell threads block

Posted on July 10, 2020 by wjwh

If you’ve played around with Haskell for any amount of time, you probably know it has a M:N threading model with many Haskell threads multiplexed by a userspace scheduler onto a few operating system (OS) threads. Since a Haskell thread uses much less memory than an OS thread, it becomes possible to spin up large numbers (hundreds of thousands to millions) of Haskell threads. This threading model has great benefits for readibility, since most programmers find it easier to follow a linear “top-to-bottom” path through the code than to follow along nested callbacks. There are also some drawbacks to this approach. For example, managing blocking system calls becomes much harder, since a thread that is doing a blocking operation cannot serve normal workloads. There is also much more bookkeeping to be done in the runtime, which would normally be left to the operating system.

In this post I’ll give a brief introduction to how the Haskell runtime system handles these problems. I will focus on the parts that handle blocking and scheduling of threads but there is much more to the runtime system, more than can feasably be covered in a single blog post. As a quick note, most of the information in this post covers Linux and other POSIX-compliant systems. Windows is not covered, as this post is already long enough. The Windows I/O subsystem is also being rewritten as we speak, so any detailed information would likely be out of date soon.

Blocking in Haskell land

In Haskell’s runtime system, a thread is just a data structure allocated on the heap like any other. This data structure is known as a Thread State Object (TSO) and contains a lot of different information regarding the code being run by the thread. Since at most as many threads can be running as there are processors available, the scheduler has to keep track of the runnable TSOs and schedule them one by one. For this purpose it keeps a “run queue” with TSOs. During every scheduler loop it takes the first TSO off the queue and runs it for a bit until its time slot is done (by default time slots are 20 ms, but you can tweak that). However, not all threads run until their time slot is up. It often happens that a Haskell thread depends on a result being computed by some other thread. In that case, the thread can do no more work until the result is available and should be scheduled out until the desired result is available.

The first class of possible blocks is caused by Haskell threads waiting for other Haskell threads. This might, for example, be the case in a producer-consumer type of situation spread over multiple threads, where there is one consumer thread waiting for results generated by multiple producer threads. Almost all such synchronisation is done by blocking on an MVar or one of its variations, like the TMVar. MVars are pretty interesting data structures: they are high level enough to present a useful interface to application programmers but their implementation hooks directly into the runtime system.

Structurally, an MVar consists of a reference to a possible value and two queues for threads currently blocked on the MVar trying to read (if the MVar is empty) or write (if it is full).

A tale of many thread queues

When a thread tries to read from an empty MVar or tries to write to a full MVar, the readMVar and putMVar functions will add its TSO to the relevant TSO queue in the MVar and remove it from the run queue of the scheduler (to be precise, it’s more like not putting the TSO back on the run queue in the first place). Conversely, writing a value to an empty MVar with threads waiting to read from it will not just put the value in the MVar, but will also take the first TSO from the “waiting to read” queue of the MVar and put it back on the run queue of the scheduler. A similar thing happens for reading from a full MVar with threads waiting to write to it.

In this view, the Haskell thread scheduling system can be seen as one “big” TSO queue for runnable threads belonging to the scheduler, combined with many smaller TSO queues for blocked threads scattered around the heap that belong to various MVars and their variations. It means that any blocking threads will not even be considered for scheduling, while still being kept track of in a principled manner.

`ThreadBlockedIndefinitelyOnMVar`

A side benefit of keeping track of TSOs with MVars is that it makes automatic deadlock detection possible in some cases: when no non-blocked threads hold a reference to an MVar, it is impossible to do putMVar or takeMVar operations on it. This means that any threads blocking on the MVar will never be woken up and the MVar as a whole can be reclaimed by the garbage collector. As part of reclaiming the space held by the MVar, any threads in the “waiting to read” and “waiting to write” queues will receive ThreadBlockedIndefinitelyOnMVar exception.

Blocking on system calls

A second cause of blocking system calls happens not between Haskell threads, but between a Haskell thread and the operating system. Trying to read data from a socket without any bytes available to read will either block until data is available to read or return immediately with EAGAIN or EWOULDBLOCK if the socket has been set to non-blocking mode. Some system calls have non-blocking modes, but not all.

Safe and unsafe

Since system calls are made available by the operating system as C libraries, they have to be accessed through the Haskell FFI. For this article, the most important distinction is whether the C function is marked as safe or as unsafe. An unsafe function will be executed in the same Haskell thread and may block. If that happens, no other Haskell threads can run on that OS thread. A safe function will prevent other threads from being blocked by spinning up a separate operating system thread and moving itself to that thread. This way, even if the function blocks it will not interfere with the operation of other Haskell threads. The downside of a safe function call is that it needs to start a new operating thread, which has nonzero costs in both time and memory.

The I/O manager

If every potentially blocking system call would spin up a separate OS thread, a Haskell program that does a lot of I/O would find itself running potentially hundreds of threads and much of the benefits of userspace scheduling would be lost. Luckily, most of the I/O and networking related system calls have non-blocking modes. In non-blocking mode, getting a result of EAGAIN or EWOULDBLOCK from such a system call indicates that the call can’t proceed at this moment without blocking. The thread can either try again later or it can use a blocking system call such as poll or epoll to wait for the socket to become available. However, having many threads blocking in safe system calls would defeat the purpose. For this case, the Haskell runtime has a separate I/O event manager, which runs poll/epoll/etc for many sockets at once on behalf of the other Haskell threads.

This is done via the threadWaitRead and threadWaitWrite functions, which a thread can call with a file descriptor. The event manager will monitor the file descriptor (along with all the other file descriptors from other threads) in a single poll/epoll loop and wake up the calling thread when the file descriptor is ready for reading or writing. This is done through the MVar mechanisms we have already seen and thus moves the problem neatly from operating threads into userspace scheduling. An almost unlimited number of file descriptors can be watched efficiently in this manner while keeping the number of operating system threads low.

Summary

Having a M:N threading model can be both a blessing and a curse. It can provide a lot of performance and convenience to programmers but it also gives rise to new problems that must be handled by the runtime system. Personally, I find it very cool that the MVar abstraction is versatile enough that it sees use throughout the stack from inter-thread coordination to waiting for connections to a socket.

That is not to say the current model is perfect, of course. The I/O manager model can only really deal with system calls that can be used in non-blocking mode, which limits its use mostly to networked I/O. That might be a large part of modern application I/O, but there are definitely some cases where it is not sufficient. As already mentioned there is work underway to improve the I/O manager when running on Windows, and with the introduction of the aio and io_uring interfaces in Linux perhaps “true” asynchronous system calls will become a reality there as well.

Again, the runtime system is pretty complicated and this post leaves out a lot of details. The threaded runtime in particular has a lot of edge cases. If you want to learn more, the illustrated guide to GHC is a good starting point. The GHC wiki also has a lot of information.