Stupid tricks with io_uring: a server that does zero syscalls per request

Posted on October 1, 2021 by wjwh

The new io_uring subsystem for Linux is pretty cool and has a lot of momentum behind it. In a nutshell, it allows for asynchronous system calls in an “event loop”-like fashion. One of the design goals is to reduce the amount of context switches between user- and kernel-space for I/O intensive programs and this got me thinking about exactly how much we can reduce these context switches. It turns out that you can get all the way to zero syscalls per connection (as measured by strace), provided you don’t care about CPU use and don’t allocate any memory. The result is very much a toy, but I enjoyed building it so that’s enough for me.

Recap of io_uring

This post will probably not make a lot of sense to you if you don’t know what io_uring is or how it works, but a complete explanation is outside the scope of this blog post. For a very complete overview of the io_uring subsystem and its capabilities, check out the excellent Lord of the io_uring website. If you are just interested in the TL;DR version: io_uring is a new(-ish) subsystem in Linux that allows for asynchronous system calls to be submitted to the kernel, potentially in big batches. Submitting many syscalls in a batch saves on context switching overhead, since only one synchronous system call has to be made to submit the requests for many asynchronous system calls. After each syscall is completed, the kernel will place the result(s) in a special bit of memory (known as the “Completion Queue”) that is shared between the userspace program and the kernel. This means that the program can check for any completed syscalls without calling into the kernel and this allows for further savings on context switching overhead.

Basic server

Keeping in mind Postel’s law, this will be the simplest possible server in regards to parsing user requests. To be precise, we will accept any and all requests, no matter if they are well-formed or not (even empty requests!). Every single request will be answered with a standardized response wishing the user a nice day, since that is a nice thing to want for our users. While malloc() and free() are not technically system calls, they do have the potential to cause the brk() and/or sbrk() system calls as part of their operation. Therefore, to make sure we don’t accidentally do any syscalls, we’ll preallocate everything and do no memory management whatsoever. Oh, the sacrifices we make for zero syscalls! Anyway, the basic loop for such a server is fairly straightforward:

A “normal” server anno 2021 would typically either run the accept operation in a “main” thread and handle each connection with a separate thread (possibly from a thread pool for efficiency), or it would run an event loop-based system in a single thread and monitor all the sockets involved with epoll() or something similar. The io_uring based version is also an event loop system, but instead of waiting for socket readiness we’ll submit all the operations asynchronously and wait until the kernel is done with them to read the result of the syscall.

To make the program work, we need to do three basic operations that would normally be handled by syscalls: accepting requests, sending the standard response to clients and closing the connections. Accepting connections can be done in liburing with the io_uring_prep_accept() function. The io_uring system uses a special 64-bit field called userdata that you can pass into a SQE and it will be copied unaltered into the corresponding CQE. It’s intended function is to provide a way for applications to track data associated with the SQE and usually it is a pointer to a struct of some sort. However in this case we simply need a way to distinguish accept() related CQEs from send()/close() related CQEs. We can do this by passing in a magic number that will only be present on accept() CQEs. After an extremely rigorous selection process I settled on the number 123.

void add_accept_request(int server_socket, 
                        struct sockaddr_in *client_addr,
                        socklen_t *client_addr_len) {
  struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
  io_uring_prep_accept(sqe,
                       server_socket,
                       (struct sockaddr *) client_addr,
                       client_addr_len,
                       0);
  // magic number in the userdata to differentiate between accept CQEs
  // and other types of CQE 
  io_uring_sqe_set_data(sqe, (void*) 123);
  io_uring_submit(&ring);
}

The write() and close() SQEs could be handled separately, but that is not necessary in this case: io_uring allows for linking SQEs to force them to be performed in a certain order. By setting IOSQE_IO_LINK in the SQE flags, we can be sure that the SQE will be performed before the next SQE submitted. This is required because submitting them both at the same time could result in the connection being closed before all the data is sent. The resulting function looks like this:

void add_write_and_close_requests(int fd) {
  struct io_uring_sqe *sqe;
  sqe = io_uring_get_sqe(&ring);
  io_uring_prep_write(sqe, fd, standard_response, strlen(standard_response), 0);
  // make sure the write is complete before doing the close():
  sqe->flags |= IOSQE_IO_LINK;
  sqe = io_uring_get_sqe(&ring);
  io_uring_prep_close(sqe, fd);
  io_uring_submit(&ring);
}

You can see the general pattern for creating a SQE in liburing: first you request a new (empty) SQE with io_uring_get_sqe(), then you call one of the many io_uring_prep_XYZ() functions on it to set the fields correctly for the type of operation you want and optionally you can set some flags on the SQE to alter how io_uring will handle it. Note that for the write() and close() SQEs no userdata is set, since we don’t do anything with the results of the actions.

Removing syscalls for submission events

One of the parts of io_uring that I find the most interesting is the possibility to have the kernel poll for new SQEs instead of the user having to inform the kernel via io_uring_enter() (wrapped by io_uring_submit() in liburing). By using io_uring_queue_init_params to pass in flags to io_uring initialisation and setting the IORING_SETUP_SQPOLL flag, the kernel will keep polling for up to params.sq_thread_idle milliseconds after the last SQE was submitted. Any SQEs you put into the SQE ring will automatically be picked up without any system calls required. After sq_thread_idle milliseconds have passed, the polling kernel thread will stop and you will need to call io_uring_enter() again to start it back up. When using liburing, io_uring_submit() will automagically keep track of whether the kernel thread is still alive and skip the syscall if it is not required.

Clearly, this is a very powerful tool for any program that wishes to submit as few syscalls as possible. One downside is that you need elevated privileges to create an io_uring with this flag set. While technically having CAP_SYS_NICE is sufficient, let’s just mandate root privileges for now. Also, the documentation states:

The kernel’s poller thread can take up a lot of CPU. 

For this particular case, I don’t think I’ll worry a lot about that. I have plenty of cores available and don’t plan on running the server for more than 5 minutes at a time anyway. So, we can add the following to the setup of the io_uring to enable submission queue polling by the kernel:

  struct io_uring_params params;

  if (geteuid()) {
    fprintf(stderr, "You need root privileges to run this program.\n");
    return 1;
  }
  // malloc(123);  memset(&params, 0, sizeof(params));
  params.flags |= IORING_SETUP_SQPOLL;
  params.sq_thread_idle = 120000; // 2 minutes in ms


  int ret = io_uring_queue_init_params(ENTRIES, &ring, &params);

As long as we submit at least one SQE every two minutes, the kernel thread will never die and so we’ll never incur a syscall when we call io_uring_submit(). Awesome!

Removing syscalls for completion events

Now that we can submit SQEs without any syscalls, surely we are already done? But no! It turns out that the normal way of polling for a CQE with liburing (with io_uring_wait_cqe()) will block if there are no CQEs ready to be consumed. So unless we have a super busy application that will always have a CQE ready to be consumed, eventually the application will issue a blocking syscall to wait until a CQE is ready. Luckily, there is also an io_uring_peek_cqe() function available that will never wait. If no CQE is available, it will return a non-zero integer, and conversely it will return zero if a CQE was waiting. We can use this fact to write the main loop of the server to loop while checking if there are any CQEs available.

  while(1){
    peek_result = io_uring_peek_cqe(&ring,&cqe);
    // peek_result is 0 if a cqe was available and -errno otherwise
    if(!peek_result){
      if (cqe->user_data == 123) {
        // magic number for an accept CQE
        add_write_and_close_requests(cqe->res);
        add_accept_request(server_socket, &client_addr, &client_addr_len);
      }
      else {
        // no action required
      }
      io_uring_cqe_seen(&ring, cqe);
    }
  }

Looping like this and continuously checking the state of the CQE ring will take up even more CPU of course, but that is pretty much a sailed ship after what we did with the kernel-side SQE polling.

Conclusion

When running this application under strace to monitor the syscalls being generated, it shows no syscalls being generated after initial startup, no matter how often you connect to it with nc or curl. So the answer is yes, you can write a server that does no syscalls for handling requests, if you are willing to forego memory allocations, don’t count asynchronous syscalls as being “real” syscalls and are also willing to use 100% of one CPU for the process and some unspecified amount of CPU for the kernel thread. Clearly this is not really a viable approach for production-quality applications, but it was fun to see how low we could go. Still though, it is interesting that it is possible at all and in general I think io_uring has a lot of potential. One of the things I would love to use it in would be an nchan-like websocket pubsub fanout server, because that kind of application would lend itself well to batching all the send() calls. I also think that many of the current event-loop-based language runtimes will eventually evolve to incorporate asynchronous syscalls “natively” instead of the current epoll-based systems. Interesting times!

The complete program can be found as a gist here.