Using Modern Linux Sockets

Introduction

On the surface of it the Linux socket API is pretty straight forward. However once you start to care about performance things get a bit more complicated, and documentation a bit more sparse.

Because I thought it'd be fun to explore this a bit more I made a simple UDP sender loop that progressively uses more advanced techniques. As we get more advanced this will cover things like Vectored IO and Generic Segmentation Offload.

This article only covers UDP sending for a number of reasons:

  • UDP is the basis of QUIC, which is a modern transport protocol you probably want to use instead of writing your own network code.
  • UDP is packet-based, rather than stream based. The APIs are slightly nicer to use because of this, but you can do similar things with TCP. Some fancy things like zero-copy can only be done in TCP though, which we can't explore here due to UDP.
  • Only the send half is implemented here, since UDP is not connected this works fine without having to write the receive half as well. While writing the code I only wrote a smile receiver to check correctness because my time is limited.
  • This is already very long and had to stop somewhere. Once you have an understanding gained from this should be sufficient to write the receive half as well. Transferring that knowledge to TCP is a bit more difficult but it should still provide a starting point.

The code shown here also lives at https://github.com/flub/socket-use in some form. It may not be exactly as shown here but should help you figure out how things work and try them out locally.

All the code here is in Rust, but since we have to interface with low-level kernel interfaces unsafe Rust is involved. However a general preference for safe Rust is taken where possible. A few things could be done slightly more performant by using more unsafe.

The Rust code is also entirely sync. This is not an exploration of async but rather of how to use the kernel interfaces.

sendto(2)

When you first use a UDP socket you probably turn to sendto(2) as this is the standard interface to send UDP datagrams. Consequently it is very well supported in most places, we can write this entirely using the Rust standard library:

use std::net::{SocketAddr, UdpSocket};

use anyhow::Result;
use sockets_use::MSG_SIZE;

const MSG_SIZE: usize = 1200;
const MSG_COUNT: usize = 10_000_000;

fn main() -> Result<()> {
    let dst_sock = UdpSocket::bind("[::1]:0")?;
    sender(dst_sock.local_addr()?)?;
    Ok(())
}

fn payloads() -> Vec<Bytes> {
    let payload: Vec<u8> = iter::repeat(1u8).take(MSG_SIZE).collect();
    let payload = Bytes::from(payload);
    iter::repeat_with(|| payload.clone())
        .take(MSG_COUNT)
        .collect()
}

fn sender(dst: SocketAddr) -> Result<()> {
    let payloads = payloads();
    let sock = UdpSocket::bind("[::1]:0")?;

    for payload in payloads {
        let n = sock.send_to(&payload, dst)?;
        assert_eq!(n, MSG_SIZE);
    }
    println!("send done");

    Ok(())
}

First some common notes:

  • Since we want to send the datagrams somewhere harmless we bind the destination socket. This test code does not read from that socket however, that just means there will be some packet loss once enough datagrams are sent since UDP is not reliable. That's fine for our test as we only care about the sending.
  • I'm sending a fair amount of datagrams with exactly 1200 bytes in them. 1200 bytes is the Maximum Transfer Unit (MTU) of QUIC so a reasonable realistic payload for our purposes. The number of them is somewhat arbitrarily chosen. It's large enough to measure the improvements we'll make using some simple tools.
  • I pre-allocate all the payloads on the heap in a single vector before starting the work. They're all the same Bytes instance so this doesn't take up too much memory. It means we have all the data pre-allocated however and we keep this bit of work the same between the different approaches.
  • We need to bind the local socket before sending so that it uses the correct source address on the outgoing IP packets. We bind to IPv6 port 0 so that the kernel will allocate us a free port and we send over IPv6.

Other than that the code here is rather straightforward, for each payload to send we call sendto(2) via the wrapper provided in std. Each call gets the payload to send and the destination address. This is as simple as it gets, but also as slow as it gets.

One more thing to note is the return value: this is the number of bytes which were really transmitted on the network. Since we only send 1200 bytes in each datagram we'll always expect this to match the bytes we requested to send as this is smaller than the Path MTU should be for UDP. In real code you might want to recover by sending the remainder if not everything was sent, depending on how you send the data.

Running this with time(1) we get the split between user and kernel time:

> time ./target/release/sendto
send done
________________________________________________________
Executed in  8.40 secs    fish           external
   usr time  0.71 secs  277.00 micros    0.71 secs
   sys time  0.66 secs  103.00 micros    7.66 secs

So we spend 0.71 seconds executing our code and the kernel spends 7.66 seconds handling the system calls for sendto(2). Let's look at those system calls:

> strace -c -e trace=%net ./target/release/sendto
send done
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   36.688678           3  10000000           sendto
  0.00    0.000003           1         2           socket
  0.00    0.000001           0         2           bind
  0.00    0.000000           0         1           getsockname
------ ----------- ----------- --------- --------- ----------------
100.00   36.688682           3  10000005           total

This executes a lot slower due to the strace overhead, but we can see from this we do our 10 million sendto(2) systemcalls. Nothing surprising.

sendmsg

The sendmsg(2) system call is the next step up. It allows us to do fancier things, but before we get there let's do a plain conversion to using sendmsg(2) to get a feel of the API:

use std::io::IoSlice;
use std::net::{Ipv6Addr, SocketAddr, UdpSocket};

use anyhow::Result;
use socket2::{Domain, Protocol, SockAddr, Type};
use sockets_use::MSG_SIZE;

fn main() -> Result<()> {
    let dst_sock = UdpSocket::bind("[::1]:0")?;
    sender(dst_sock.local_addr()?)?;
    Ok(())
}

fn payloads() -> Vec<Bytes> {
    let payload: Vec<u8> = iter::repeat(1u8).take(MSG_SIZE).collect();
    let payload = Bytes::from(payload);
    iter::repeat_with(|| payload.clone())
        .take(MSG_COUNT)
        .collect()
}

fn sender(dst: SocketAddr) -> Result<()> {
    let payloads = payloads();

    let sock = socket2::Socket::new(Domain::IPV6, Type::DGRAM, Some(Protocol::UDP))?;
    let addr = SocketAddr::from((Ipv6Addr::LOCALHOST, 0));
    let addr = SockAddr::from(addr);
    sock.bind(&addr)?;
    let dst = SockAddr::from(dst);

    for payload in payloads {
        let buf = IoSlice::new(&payload);
        let n = sock.send_to_vectored([buf].as_slice(), &dst)?;
        assert_eq!(n, MSG_SIZE);
    }
    println!("send done");

    Ok(())
}

Firstly we had to leave the realm of Rust's std and use the socket2 crate, but we can still do everything without using unsafe.

The socket is created slightly differently. This version is a thin wrapper around the underlying filedescriptor, likewise the SocketAddr is different. These new structs have exactly the same memory layout as their C counterparts so they can be used directly when calling the kernel. We don't need this just yet, but later we'll see this in use. Just as before we still need bind the socket locally before use so that we have a source port.

To actually send we use Socket::send_to_vectored which is a function that is implemented using sendmsg(2) under the hood. We need to pass it what the kernel calls a struct iovec, or just iovec. The Rust's standard library exposes the IoSlice type which is guaranteed to be ABI compatible with iovec, std has read and write APIs which use this, but does not expose it for sockets hence we still need socket2.

This IoSlice is passed in as a slice, this is effectively Vectored IO though we don't really use it here. We'll get back to that next.

Despite all this different setup, we're not actually asking the kernel to do anything different from before. We're only using a different API to do the same thing. Let's look how it performs:

> time ./target/release/sendmsg
send done
________________________________________________________
Executed in    8.95 secs    fish           external
   usr time    0.63 secs  566.00 micros    0.63 secs
   sys time    8.29 secs  209.00 micros    8.29 secs
> strace -c -e trace=%net ./target/release/sendmsg
send done
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   33.253122           3  10000000           sendmsg
  0.00    0.000027          13         2           socket
  0.00    0.000017           8         2           bind
  0.00    0.000007           7         1           getsockname
------ ----------- ----------- --------- --------- ----------------
100.00   33.253173           3  10000005           total

As expected there isn't really any difference here, the difference is just the noise of a not-serious-at-all benchmark setup -- my laptop without any tuning.

Vectored IO

Now that we've used sendmsg(2) and saw it take the payload as a slice of buffers we can talk about vectored IO. The struct iovec is used by the kernel in quite a few places involving reading and writing, sockets just being one of them. The idea is that it is a vector of buffers which all need to be written, or sent, one after the other. The kernel promises to do this "atomically", that is writing all those buffers has the same semantics as if it was writing a single buffer wrt what was written when errors occur etc.

So how do you use this? Well in our example so far we had a single payload of 1200 bytes. However in real life we probably have to assemble that. One example of this could be in QUIC which allows packing multiple frames into a single datagram, but there can be many similar usecases.

What's the benefit to assembling this buffer in userspace? Let's find out by first adapting our example assembling the payload in userspace and then comparing it to using the iovec for real. We'll construct each payload from a "header", which we create on the fly, and two "frames". As a slight variation we don't pre-allocate all our frames here but use an iterator, as long as we do the same in the next version that we'll compare that's fine for comparison purposes:

fn sender(dst: SocketAddr) -> Result<()> {
    // 8 bytes header, 2 596 frames => 1200 bytes payload
    let frame: Vec<u8> = iter::repeat(1u8).take((MSG_SIZE - 8) / 2).collect();
    let frame = Bytes::from(frame);
    debug_assert_eq!(frame.len(), 596);
    let mut frames = iter::repeat(frame).take(MSG_COUNT * 2);

    let sock = socket2::Socket::new(Domain::IPV6, Type::DGRAM, Some(Protocol::UDP))?;
    let addr = SocketAddr::from((Ipv6Addr::LOCALHOST, 0));
    let addr = SockAddr::from(addr);
    sock.bind(&addr)?;
    let dst = SockAddr::from(dst);

    let hdr = b"abcdabcd";

    while let Some(frame0) = frames.next() {
        let frame1 = frames.next().expect("odd number of frames");
        let mut buf = BytesMut::with_capacity(MSG_SIZE);
        buf.put(hdr.as_slice());
        buf.put(frame0);
        buf.put(frame1);
        debug_assert_eq!(buf.len(), MSG_SIZE);

        let buf = IoSlice::new(&buf);
        let n = sock.send_to_vectored([buf].as_slice(), &dst)?;
        assert_eq!(n, MSG_SIZE);
    }
    println!("send done");
    Ok(())
}

The changes from the last version how we construct the buffer: we use a BytesMut in which we copy the header and both frames. Once we have this new buffer we send it like before.

So how does this do?

> time ./target/release/sendmsg_frames
send done
________________________________________________________
Executed in    9.45 secs    fish           external
   usr time    1.18 secs  663.00 micros    1.18 secs
   sys time    8.25 secs  245.00 micros    8.25 secs

The system time stayed about the same, but the time in user-space has increased significantly. That makes sense, we're copying all our payloads around this time which we didn't do before.

So how does this look if we use Vectored IO? Instead of copying our header and two frames into a single buffer we'll call sendmsg(2) with an iovec with 3 items: the header and two frames. Notice how the IoSlice is no more than a pointer to an existing buffer somewhere else:

// Only the while loop changed, the rest is identical.
while let Some(frame0) = frames.next() {
    let frame1 = frames.next().expect("odd number of frames");

    let hdr_buf = IoSlice::new(hdr);
    let frame0_buf = IoSlice::new(&frame0);
    let frame1_buf = IoSlice::new(&frame1);
    let n = sock.send_to_vectored([hdr_buf, frame0_buf, frame1_buf].as_slice(), &dst)?;
    assert_eq!(n, MSG_SIZE);
}

Only the while loop changed, so I didn't show all the code again. You can see the entire code in sendmsg_frames_iov.rs if you wish.

The crux here is that we do not copy anything in userspace, instead we create a slice of 3 pointers and the kernel copies all the data from there directly. Let's see how it performs:

> time ./target/release/sendmsg_frames_iov
send done
________________________________________________________
Executed in    9.35 secs    fish           external
   usr time    0.77 secs  330.00 micros    0.77 secs
   sys time    8.56 secs  122.00 micros    8.56 secs

And our user time has reduced again without affecting our system time as much. Vectored IO is pretty neat!

Manual sendmsg

So far we used the wrapper from socket2 to make the sendmsg(2) calls. It's convenient and works well. But to continue our journey we'll have to learn how to invoke it by hand using unsafe code:

// Setup of frames/payload and sockets is identical as before.
let mut msg: libc::msghdr = unsafe { mem::zeroed() };

while let Some(frame0) = frames.next() {
    let frame1 = frames.next().expect("odd number of frames");

    let hdr_buf = IoSlice::new(hdr);
    let frame0_buf = IoSlice::new(&frame0);
    let frame1_buf = IoSlice::new(&frame1);
    let bufs = [hdr_buf, frame0_buf, frame1_buf];

    // Casting these pointers to mut is fine as we only send.  The types are mut only
    // because they are also used in recvmsg(2) which we do not use.
    msg.msg_name = dst.as_ptr() as *mut _;
    msg.msg_namelen = dst.len();
    msg.msg_iov = bufs.as_ptr() as *mut _;
    msg.msg_iovlen = bufs.len();

    let n = unsafe { libc::sendmsg(sock.as_raw_fd(), &msg, 0) };
    if n == -1 {
        return Err(io::Error::last_os_error().into());
    }
    assert_eq!(n as usize, MSG_SIZE);
}

We start with declaring a struct msghdr outside of the loop, initialised with zeroes. If you look at the manpage for sendmsg it shows the struct defined as:

struct msghdr {
    void         *msg_name;       /* Optional address */
    socklen_t     msg_namelen;    /* Size of address */
    struct iovec *msg_iov;        /* Scatter/gather array */
    size_t        msg_iovlen;     /* # elements in msg_iov */
    void         *msg_control;    /* Ancillary data, see below */
    size_t        msg_controllen; /* Ancillary data buffer len */
    int           msg_flags;      /* Flags (unused) */
};

Thus having it filled with all zeroes is legal. It will be full of NULL-pointers and zero-lengths and no flags.

As we loop through the datagrams or messages we're sending we only need to fill in two bits per call:

  • The destination IP address: msg_name can be a pointer directly to the socket2::SocketAddr as this is ABI compatible. It's len method also gives us the number of bytes it takes up, since this could be an IPv4 or IPv6 address this is not always the same.
  • The iovec is also straight forward as a pointer and the number of buffers in the slice.

You might even argue that this could have been done out of the loop as this never changes, however in real life you probably wouldn't be able to optimise this thus we leave it in the loop here.

Note the casting we need to do for the pointers: we need the pointers to match those from the C structs and since these C structs are used for both the send and recv paths we need to make them mutable pointers even though the kernel only ever reads from them in our case.

Finally we need to make the call to sendmsg(2) for which we use the wrapper provided by the libc crate rather than invoking the syscall directly, which would also be possible. The arguments are pointers to the socket and struct msghdr and finally a flag which we don't use. Since this is a raw mapping to an FFI function we need to handle the error in the return value manually.

Running this gives us exactly the same as the previous version, but we now had our first encounter with directly creating the structure that the kernel uses: how to create it and how to call into the kernel.

sendmmsg

Now we know about vectored IO let us go back and stop using it again. Instead let's turn to optimising the vast amount of where the time is spent so far: in the kernel.

The next step to improving this is to reduce the number of system calls made since each system call has a fair amount of overhead. This is what sendmmsg(2) allows us to do. Its basic idea is to pass in an array of messages rather than a single one. So an array of struct msghr. The kernel will then take care of sending all those messages at once.

const BATCH_SIZE: usize = 64;

// main() and payload() remain identical.

fn sender(dst: SockAddr) -> Result<()> {
    let payloads = payloads();

    let sock = Socket::new(Domain::IPV6, Type::DGRAM, Some(Protocol::UDP))?;
    let addr = SocketAddr::from((Ipv6Addr::LOCALHOST, 0));
    let addr = SockAddr::from(addr);
    sock.bind(&addr)?;

    let mut mmsgs: [libc::mmsghdr; BATCH_SIZE] = unsafe { mem::zeroed() };

    for batch in payloads.chunks(BATCH_SIZE) {
        for (i, payload) in batch.iter().enumerate() {
            let buf = IoSlice::new(payload);
            let bufs = [buf];

            let mmsg = &mut mmsgs[i].msg_hdr;
            mmsg.msg_name = dst.as_ptr() as *mut _;
            mmsg.msg_namelen = dst.len();
            mmsg.msg_iov = bufs.as_ptr() as *mut _;
            mmsg.msg_iovlen = bufs.len();
        }
        let ret = unsafe {
            libc::sendmmsg(
                sock.as_raw_fd(),
                mmsgs.as_mut_ptr(),
                batch.len().try_into()?,
                0,
            )
        };
        if ret == -1 {
            return Err(io::Error::last_os_error().into());
        }
        assert_eq!(ret, batch.len().try_into()?); // Number of messages sent
        for mmsg in mmsgs {
            assert_eq!(mmsg.msg_len as usize, MSG_SIZE); // Number of bytes sent.
        }
    }
    println!("send done");
    Ok(())
}

This code looks very familiar to the last version, the main difference is that instead of a single struct msghdr we create an array of struct mmsghdr items and each loop will now fill this array and send an entire batch of messages with a single sendmmsg(2) syscall.

So what is struct mmsghdr? Let's look at the manual page for sendmmsg(2):

struct mmsghdr {
    struct msghdr msg_hdr;  /* Message header */
    unsigned int  msg_len;  /* Number of bytes transmitted */
};

This is merely a wrapper around struct msghdr. The msg_len field is not populated by the caller but rather set by the kernel. It is effectively the return code of each sendmsg(2) call the kernel does internally for this entire batch of messages, which is the number of bytes transmitted on the network. So we only need to set msg_len to 0 and read it after the call.

Otherwise this code is similar; for each batch we fill in the struct msghdr data as we did before. Finally when calling sendmmsg(2) we need to pass it the batch size we created as well as the flags argument we don't use in our examples.

So how does this fare performance wise?

> time ./target/release/sendmmsg
send done
________________________________________________________
Executed in    7.97 secs    fish           external
   usr time    0.20 secs    0.00 micros    0.20 secs
   sys time    7.75 secs  729.00 micros    7.75 secs
> strace -c -e trace=%net ./target/release/sendmmsg
send done
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   16.602542         106    156250           sendmmsg
  0.00    0.000030          15         2           socket
  0.00    0.000019           9         2           bind
  0.00    0.000006           6         1           getsockname
------ ----------- ----------- --------- --------- ----------------
100.00   16.602597         106    156255           total

We should compare this to the sendmsg example from earlier. The time improvement here is modest, most of the improvement comes in user-space and the time spent in kernel-space has only marginally improved. This is to be expected since the main thing we optimised is the syscall overhead. The kernel still has to do the same amount of work to send all those messages through its network stack.

Generic Segmentation Offload

The next step is to use Generic Segmentation Offload or GSO. The idea of GSO is to pass one large buffer with all the payloads concatenated together to the kernel. For this to work all the payloads passed this way must have the same destination.

The reason this ends up being more efficient is that the kernel will keep passing this payload around it's internal networking stack as one single oversized packet until only the very last moment where it must split it up into individual packets. If the hardware supports it, splitting into individual packets can even be done in the network hardware. By doing this the vast majority of the network stack is only processing one packet instead of many.

So how does this work? First we need to know if we can use GSO:

fn check_gso() -> Result<usize> {
    let sock = UdpSocket::bind("[::1]:0")?;

    // As defined in `udp(7)` and in linux/udp.h
    // #define UDP_MAX_SEGMENTS        (1 << 6UL)
    set_socket_option(&sock, libc::SOL_UDP, libc::UDP_SEGMENT, 1500)
        .map(|_| 64)
        .map_err(|e| e.into())
}

This tries to set the UDP_SEGMENT socket option, if this fails we can not use GSO. If it succeeds however, it returns the maximum number of payloads we are allowed to concatenate in one oversized packet. But wait, we just return a hardcoded 64? Yes. This is documented as a constant, maintaining kernel interfaces is very hard as they can basically never change this. While in theory this constant comes from a header file you can include in C code, we unfortunately can't really do any better than hardcoding this in Rust.

What about the other magic number in there, 1500? This is the size in bytes of a single segment. A segment is the term used by the kernel to indicate the payload of one packet when using GSO. When you send a message using GSO all the segments or payload have to be this same size, only the final segment can be shorter.

Here we throw away this socket after setting the socket option. We could however keep it around and use it. From then on the kernel would assume any messages we send on the socket would use GSO with this given segment size. However in reality that's not very practical, your segment sizes probably change more often and setting a socket option is yet another systemcall, which we want to avoid. So instead we will tell the kernel the segment size together with each message we are sending.

With this out of the way, let's look at the full code for the sender. We go back to using a single sendmsg(2) call for now to keep it as simple as possible. It still is a lot though, so we'll go over it slowly next:

// main() and payloads() identical to befoe.
fn sender(dst: SocketAddr) -> Result<()> {
    let payloads = payloads();

    let sock = Socket::new(Domain::IPV6, Type::DGRAM, Some(Protocol::UDP))?;
    let addr = SocketAddr::from((Ipv6Addr::LOCALHOST, 0));
    let addr = SockAddr::from(addr);
    sock.bind(&addr)?;
    let dst = SockAddr::from(dst);

    // Figure out our batch size, we may not exceed max_gso_segments for a gso batch, but a
    // single msghdr's payload, i.e. the total size of it's iovec, may not exceed u16::MAX.
    let max_gso_segments = check_gso()?;
    let max_payloads = (u16::MAX / MSG_SIZE as u16) as usize;
    let gso_batch_size = max_gso_segments.min(max_payloads);

    let mut msg: libc::msghdr = unsafe { mem::zeroed() };
    let mut iovec: Vec<IoSlice> = Vec::with_capacity(gso_batch_size);

    for batch in payloads.chunks(gso_batch_size) {
        iovec.clear();
        iovec.extend(batch.iter().map(|payload| IoSlice::new(payload)));
        msg.msg_name = dst.as_ptr() as *mut _;
        msg.msg_namelen = dst.len();
        msg.msg_iov = iovec.as_ptr() as *mut _;
        msg.msg_iovlen = iovec.len();

        // The value of the auxiliary data to put in the control message.
        let segment_size: u16 = MSG_SIZE.try_into()?;
        // The number of bytes needed for this control message.
        let cmsg_size = unsafe { libc::CMSG_SPACE(mem::size_of_val(&segment_size) as _) };
        let layout = Layout::from_size_align(cmsg_size as usize, mem::align_of::<libc::cmsghdr>())?;
        let buf = unsafe { std::alloc::alloc(layout) };
        if buf.is_null() {
            bail!("alloc failed");
        }
        msg.msg_control = buf as *mut libc::c_void;
        msg.msg_controllen = layout.size();
        let cmsg: &mut libc::cmsghdr = unsafe {
            // We *must* initialise this memory before creating the reference to avoid UB.
            let cmsg = libc::CMSG_FIRSTHDR(&msg);
            if cmsg.is_null() {
                bail!("No space for cmsg");
            }
            let cmsg_zeroed: libc::cmsghdr = mem::zeroed();
            ptr::copy_nonoverlapping(&cmsg_zeroed, cmsg, 1);
            cmsg.as_mut().ok_or(anyhow!("No space for cmsg"))?
        };
        cmsg.cmsg_level = libc::SOL_UDP;
        cmsg.cmsg_type = libc::UDP_SEGMENT;
        cmsg.cmsg_len =
            unsafe { libc::CMSG_LEN(mem::size_of_val(&segment_size) as _) } as libc::size_t;
        unsafe { ptr::write(libc::CMSG_DATA(cmsg) as *mut u16, segment_size) };

        let ret = unsafe { libc::sendmsg(sock.as_raw_fd(), &msg, 0) };
        unsafe { std::alloc::dealloc(buf, layout) };
        if ret == -1 {
            return Err(io::Error::last_os_error().into());
        }
        assert_eq!(ret as usize, MSG_SIZE * batch.len());
    }
    println!("send done");
    Ok(())
}

The beginning of this is familiar by now: we create our payloads up-front and bind our socket. The first interesting section appears next though:

// Figure out our batch size, we may not exceed max_gso_segments for a gso batch, but a
// single msghdr's payload, i.e. the total size of it's iovec, may not exceed u16::MAX.
let max_gso_segments = check_gso()?;
let max_payloads = (u16::MAX / MSG_SIZE as u16) as usize;
let gso_batch_size = max_gso_segments.min(max_payloads);

The comment already says a lot. Since our oversized packet has to travel through the normal kernel networking stack the payload can not be larger than the maximum payload of a packet for our type of message. In this case we're sending UDP messages and the maximum payload is u16::MAX. So we need to adjust our batch size so that we neither exceed u16::MAX nor the maximum of 64 segments in one message.

Next up we initialise some variables outside of the loop:

let mut msg: libc::msghdr = unsafe { mem::zeroed() };
let mut iovec: Vec<IoSlice> = Vec::with_capacity(gso_batch_size);

We will be sending a single oversized message to the kernel at a time using sendmsg(2), so we only need a single struct msghdr which we can re-initialise on each iteration. Likewise we'll reuse a single iovec as well. We will send each segment as a separate entry in this iovec so we can use vectored IO again instead of having to concatenate our payloads manually. To keep it simple we use a Vec instead of a stack-allocated array. By pre-allocating the desired capacity there is no measurable penalty for using a Vec and it is a lot easier to use.

Time to look at the loop:

for batch in payloads.chunks(gso_batch_size) {
    iovec.clear();
    iovec.extend(batch.iter().map(|payload| IoSlice::new(payload)));
    msg.msg_name = dst.as_ptr() as *mut _;
    msg.msg_namelen = dst.len();
    msg.msg_iov = iovec.as_ptr() as *mut _;
    msg.msg_iovlen = iovec.len();

Nothing too surprising here, we make sure our iovec has no entries and then build it up using the payloads in the batch. We also initialise msghdr as before, setting the destination address and payload buffer. Remember that the kernel will will treat all entries in the iovec as a single large payload, it will then split up the payload into 1200 segments to send as individual UDP datagrams. This splitting into segments happens to be exactly on the boundaries in our iovec but as far as the kernel is concerned that's pure coincidence.

Ancillary Data via Control Messages

So we need to tell the kernel that we are using GSO by telling it the segment size. This is done using the two fields of struct msghdr which we've left 0-initialised before: msg_control and msg_controllen. Like the other fields of struct msghdr they are an array and the array length, but this is an array of so called control messages. With each sendmsg(2) call we make, we can pass several of these control messages to provide what the kernel calls ancillary data for this sendmsg(2) call. These control messages can be used to instruct the kernel to do special things to the particular datagram being sent.

Examples of control messages are things like using Explicit Congestion Notification (ECN), a mechanism to get more feedback about congestion control in userspace. These control messages are also used for recvmsg calls where they can be used to give you information about which network interface a datagram was received on or which IP address it was addressed to (you might be bound on the wildcard address so could be receiving messages on different interfaces). But in our case we want to use it to tell the kernel the segment size of our GSO payload.

As mentioned, the segment size means how many bytes of the payload will be considered to be part of each individual message. Only the last message in the GSO batch can have a smaller size. In our case each payload is 1200 bytes so we'll set the segment size to 1200.

To construct the control messages we need to look at the cmsg(3) manpage. This gives us this definition:

struct cmsghdr {
    size_t cmsg_len;    /* Data byte count, including header
                           (type is socklen_t in POSIX) */
    int    cmsg_level;  /* Originating protocol */
    int    cmsg_type;   /* Protocol-specific type */
  /* followed by
    unsigned char cmsg_data[]; */
};

That's an unusual definition because there are so many different types of control messages. Some are used to pass information from userspace to kernelspace and some are used to pass information in the other direction. And not all of those have the same cmsg_data size, some might be passing simple integers while other might be passing structures. As a consequence constructing control messages is messy: the manpage directs you to a bunch of weird macros to manipulate control messages.

While we are only passing one single control messages in this example, we really are constructing an array of control messages. It needs to be constructed using the libc macros however, which means we must allocate a buffer large enough for all the control messages. Of course when you allocate a buffer for a struct it also needs to have the correct alignment. Rust doesn't make this easy, there are only a few ways to get something aligned: either decorate a struct with the correct #[repr(align(n))], but you need to know the n. Quinn takes this approach and "guesses" at what a compatible alignment would be, asserting it at runtime.

You could probably use Union in the same way as you can do in C for this. Though the downside is that you need to define the size of the control messages array statically and in Rust libc::CMSG_SPACE() is an FFI call rather than a preprocessor macro. So instead we manually invoke the allocator to give us a buffer:

// The value of the auxiliary data to put in the control message.
let segment_size: u16 = MSG_SIZE.try_into()?;
// The number of bytes needed for this control message.
let cmsg_size = unsafe { libc::CMSG_SPACE(mem::size_of_val(&segment_size) as _) };
let layout = Layout::from_size_align(cmsg_size as usize, mem::align_of::<libc::cmsghdr>())?;
let buf = unsafe { std::alloc::alloc(layout) };
if buf.is_null() {
    bail!("alloc failed");
}

Here you can see we need to call libc::CMSG_SPACE() to figure out the size of the allocation we need. While the manpage told us this is a C preprocessor macro, the libc crate helpfully exposes the cmsg macros as functions. The only effect we notice of this is that the argument types are a bit funky and need more casting. In reality all the arguments are wide enough for any cmsg that needs to be built.

Once we know how much space we need to allocate we can use it in a Layout. If you need to use multiple control messages you'd have to take the sum of all the CMSG_SPACE() calls together for each control message used.

Now that we have a buffer we need to initialise it. Turns out we again are not allowed to do this directly but we always have to ask the kernel for the pointer of the struct cmsghdr to fill in using libc::CMSG_FIRSTHDR() for the first control message and libc::CMSG_NXTHDR() for any subsequent control messages. Before we are allowed to do this we need to set the msghdr.msg_control and msghdr.msg_controllen fields however as the macros use this information. msg_controllen is the number of control messages we need.

msg.msg_control = buf as *mut libc::c_void;
msg.msg_controllen = layout.size();
let cmsg: &mut libc::cmsghdr = unsafe {
    // We *must* initialise this memory before creating the reference to avoid UB.
    let cmsg = libc::CMSG_FIRSTHDR(&msg);
    if cmsg.is_null() {
        bail!("No space for cmsg");
    }
    let cmsg_zeroed: libc::cmsghdr = mem::zeroed();
    ptr::copy_nonoverlapping(&cmsg_zeroed, cmsg, 1);
    cmsg.as_mut().ok_or(anyhow!("No space for cmsg"))?
};
cmsg.cmsg_level = libc::SOL_UDP;
cmsg.cmsg_type = libc::UDP_SEGMENT;
cmsg.cmsg_len =
    unsafe { libc::CMSG_LEN(mem::size_of_val(&segment_size) as _) } as libc::size_t;
unsafe { ptr::write(libc::CMSG_DATA(cmsg) as *mut u16, segment_size) };

As you see we now get to navigate Rust's handling of uninitialised data and hope we carefully manage to tip-toe around the compiler's Undefined Behaviour (UB). Because we allocated the buffer in which we put the cmsghdr using std::alloc::alloc() it contains uninitialised memory, this we must never read from, nor even create any kind of reference to that would allow us to read from it. We must fill it using unsafe pointers only. Technically there is a foreign function interface (FFI) call involved which returns us a pointer to the struct cmsghdr and the compiler might not know if that did initialised that memory or not. So it probably can't give us UB if we treat the pointer returned by CMSG_FIRSTHDR as initialised. But you can't be careful enough so let's not count on that.

There are several options to initialise cmsghdr. We chose to initialise it with mem::zeroed(). Looking at the struct definition that is valid: it is all integers and char data. This allows us to create valid reference to the struct in safe Rust where we can initialise the members in a more normal fashion. The alternative is to use std::ptr::addr_of_mut!((*cmsg).cmsg_level) etc to give us individual pointers to the struct fields and then write directly to those pointers. It turns out that both approaches are the same for the compiler: it figures out you don't really want those zeroes anyway and it never writes them, so the two versions compile to exactly the same.

There are two more oddities in this code. One is that in the first unsafe block we check for a null-pointer and bail!() if we did get one, only to later again use .ok_or(...)? on the Option which would be None only if the pointer was NULL. As we already know this isn't a null-pointer we could have used .unwrap() there. Though this would pull in the panic handling code into the function, which can be undesirable for something on the hot path. We can not forgo the earlier bail!() however, as then we could be using a NULL-pointer, which is not sound.

Secondly while filling in cmsg_len we need the libc::CMSG_LEN() call again, even more importantly is how we need to write the actual data into the control message: again we are obliged to only access this via the pointer returned by libc::CMSG_DATA() and as that's a pointer we need to write it using ptr::write() or similar.

Finally we get to make our sendmsg(2) call. Nothing special here:

let ret = unsafe { libc::sendmsg(sock.as_raw_fd(), &msg, 0) };
unsafe { std::alloc::dealloc(buf, layout) };
if ret == -1 {
    return Err(io::Error::last_os_error().into());
}
assert_eq!(ret as usize, MSG_SIZE * batch.len());

As before we need to handle the error code manually, however also note we need to remember to free the memory we manually allocated.

Performance

> time ./target/release/sendmsg_gso
send done
________________________________________________________
Executed in    2.25 secs    fish           external
   usr time    0.20 secs    0.00 micros    0.20 secs
   sys time    2.05 secs  821.00 micros    2.05 secs

This is faster than a plain sendmsg(2) in both user and system time. And especially system time! The system time improvement is really what shines here as that is what the GSO optimisation is all about. It turns out the kernel has made this optimisation work really well!

Note that this will depend somewhat. As mentioned at the start, GSO gets better with hardware support. This test is using the loopback interface only, not even involving hardware. Likely the numbers here will be along the upper bound of the GSO improvements you can get, though also with real hardware it can take care of things like calculating the IP checksums which does not need to happen at all in software in that case. So really which way it swings will depend on the hardware, but in general you can expect great improvements from GSO on modern hardware.

> strace -c -e trace=%net ./target/release/sendmsg_gso
send done
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    2.543970          13    185186           sendmsg
  0.00    0.000021           7         3           socket
  0.00    0.000015           5         3           bind
  0.00    0.000005           5         1           getsockname
  0.00    0.000002           2         1           setsockopt
------ ----------- ----------- --------- --------- ----------------
100.00    2.544013          13    185194           total

Here you can see probably the main reason why the userspace time also improved: fewer system calls. If you look carefully you'll see this still does more system calls than in the sendmmsg example above. This is because we had to keep our total payload size below u16::MAX per sendmsg(2) call, resulting in a slightly lower batch size than we used for our sendmmsg example.

sendmmsg + GSO

So finally we can put sendmmsg(2) and GSO together. This allows us to use a single system call to send messages to different destinations, while each destination can use GSO itself. Let's find out:

// main(), payloads() and check_gso() as before.
fn sender(dst: SocketAddr) -> Result<()> {
    let payloads = payloads();

    let sock = Socket::new(Domain::IPV6, Type::DGRAM, Some(Protocol::UDP))?;
    let addr = SocketAddr::from((Ipv6Addr::LOCALHOST, 0));
    let addr = SockAddr::from(addr);
    sock.bind(&addr)?;
    let dst = SockAddr::from(dst);

    // Figure out our batch size, we may not exceed max_gso_segments for a gso batch, but a
    // single msghdr's payload, i.e. the total size of it's iovec, may not exceed u16::MAX.
    let max_gso_segments = check_gso()?;
    let max_payloads = (u16::MAX / MSG_SIZE as u16) as usize;
    let gso_batch_size = max_gso_segments.min(max_payloads);

    let mut mmsgs: [libc::mmsghdr; BATCH_SIZE] = unsafe { mem::zeroed() };
    let mut iovecs: Vec<Vec<IoSlice>> = iter::repeat_with(|| Vec::with_capacity(gso_batch_size))
        .take(BATCH_SIZE)
        .collect();
    let mut bufs: Vec<(*mut u8, Layout)> = Vec::with_capacity(BATCH_SIZE);

    for batch in payloads.chunks(gso_batch_size * BATCH_SIZE) {
        let mut mmsg_batch_size = 0;
        for (i, gso_batch) in batch.chunks(gso_batch_size).enumerate() {
            mmsg_batch_size += 1;
            let msg = &mut mmsgs[i].msg_hdr;
            let iovec = &mut iovecs[i];
            iovec.clear();
            iovec.extend(gso_batch.iter().map(|payload| IoSlice::new(payload)));
            msg.msg_name = dst.as_ptr() as *mut _;
            msg.msg_namelen = dst.len();
            msg.msg_iov = iovec.as_ptr() as *mut _;
            msg.msg_iovlen = iovec.len();

            let segment_size: u16 = MSG_SIZE.try_into()?;
            let cmsg_size = unsafe { libc::CMSG_SPACE(mem::size_of_val(&segment_size) as _) };
            let layout =
                Layout::from_size_align(cmsg_size as usize, mem::align_of::<libc::cmsghdr>())?;
            let buf = unsafe { std::alloc::alloc(layout) };
            if buf.is_null() {
                bail!("alloc failed");
            }
            bufs.push((buf, layout));
            msg.msg_control = buf as *mut libc::c_void;
            msg.msg_controllen = layout.size();
            let cmsg: &mut libc::cmsghdr = unsafe {
                // We *must* initialise this memory before creating the reference to avoid UB.
                let cmsg = libc::CMSG_FIRSTHDR(&*msg);
                if cmsg.is_null() {
                    bail!("No space for cmsg");
                }
                let cmsg_zeroed: libc::cmsghdr = mem::zeroed();
                ptr::copy_nonoverlapping(&cmsg_zeroed, cmsg, 1);
                cmsg.as_mut().ok_or(anyhow!("No space for cmsg"))?
            };
            cmsg.cmsg_level = libc::SOL_UDP;
            cmsg.cmsg_type = libc::UDP_SEGMENT;
            cmsg.cmsg_len =
                unsafe { libc::CMSG_LEN(mem::size_of_val(&segment_size) as _) } as libc::size_t;
            unsafe { ptr::write(libc::CMSG_DATA(cmsg) as *mut u16, segment_size) };
        }
        let ret =
            unsafe { libc::sendmmsg(sock.as_raw_fd(), mmsgs.as_mut_ptr(), mmsg_batch_size, 0) };
        for (buf, layout) in bufs.drain(..) {
            unsafe { std::alloc::dealloc(buf, layout) };
        }
        if ret == -1 {
            return Err(io::Error::last_os_error().into());
        }
        let ret: u32 = ret.try_into().expect("see error return just above");
        assert_eq!(ret, mmsg_batch_size); // Number of messages sent

        // for mmsg in mmsgs {
        //     assert_eq!(mmsg.msg_len as usize, MSG_SIZE * gso_batch.len()); // Number of bytes sent.
        // }
    }
    println!("send done");
    Ok(())
}

The beginning here is the same as before: we set up the payloads and socket. Then check GSO and find out the batch size to use. Then it gets more complicated:

let mut mmsgs: [libc::mmsghdr; BATCH_SIZE] = unsafe { mem::zeroed() };
let mut iovecs: Vec<Vec<IoSlice>> = iter::repeat_with(|| Vec::with_capacity(gso_batch_size))
    .take(BATCH_SIZE)
    .collect();
let mut bufs: Vec<(*mut u8, Layout)> = Vec::with_capacity(BATCH_SIZE);

Firstly we'll need a slice of struct mmsghdr for the sendmmsg(2) call, we create a zero-filled array for this again. This time we also need a slice of struct iovec items for each struct mmsghdr since we will use GSO for each one. We create this as two nested vectors, the inner Vec is what will be passed to each msghdr.msg_iov. Finally since for each strut mmsghdr will also need an allocated buffer for the control messages we pre-allocate somewhere to store those so we can free them once we're done with them.

for batch in payloads.chunks(gso_batch_size * BATCH_SIZE) {
    let mut mmsg_batch_size = 0;
    for (i, gso_batch) in batch.chunks(gso_batch_size).enumerate() {
        mmsg_batch_size += 1;
        let msg = &mut mmsgs[i].msg_hdr;
        let iovec = &mut iovecs[i];
        iovec.clear();
        iovec.extend(gso_batch.iter().map(|payload| IoSlice::new(payload)));
        msg.msg_name = dst.as_ptr() as *mut _;
        msg.msg_namelen = dst.len();
        msg.msg_iov = iovec.as_ptr() as *mut _;
        msg.msg_iovlen = iovec.len();

        // next build the control message

Next we start iterating over our payloads, we take them in huge chunks of gso_batch_size * BATCH_SIZE this time: the number of GSO segments per struct msghdr times the number of struct mmsghdr we'll pass to each sendmmsg(2) call.

For each iteration we need to fill in one of the struct mmsghdr items, we only need to fill in the mmsghdr.msg_hdr field for each so we get a reference to this with let msg = &mut msgs[i].msg_hdr;.

Once we have this single struct msghdr things should look fairly familiar again: we fill in the iovec and also set the destination address. We then build the control message again in the same fashion as before for GSO. The only difference is that we store the buffer we allocated for it in the bufs vector, so we can free them all after the sendmmsg(2) call.

Finally also the libc::sendmmsg call is the same as for our earlier sendmmsg example. This whole example should be fairly straightforward to understand with all the previous examples, though it is still a bit fiddly to set up.

> time ./target/release/sendmsg_gso
send done
________________________________________________________
Executed in    2.18 secs    fish           external
   usr time    0.18 secs  507.00 micros    0.18 secs
   sys time    1.99 secs  195.00 micros    1.99 secs

The system the time doesn't change much here, that's expected. The kernel doesn't do any less work then before. What is more surprising is that user time changed equally little, did we not do fewer system calls?

> strace -c -e trace=%net ./target/release/sendmsg_gso
send done
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
00.00    2.014222         695      2894           sendmmsg
  0.00    0.000008           2         3           socket
  0.00    0.000005           1         3           bind
  0.00    0.000000           0         1           getsockname
  0.00    0.000000           0         1           setsockopt
------ ----------- ----------- --------- --------- ----------------
100.00    2.014235         694      2902           total

Indeed, we did reduce the system calls rather dramatically again. I mostly blame the crude benchmarking here for not seeing much difference. If we increased the number of payloads to send we'd start seeing that difference more clearly again. It does point out however that reducing the number of system calls only starts to matter if you do a lot of them.

Lastly remember that our example sent everything to the same destination. In real life you'd be using this to send to different destinations because for a single destination you'd only bother using GSO on it's own.

Conclusion

This was a long tour of how to send UDP messages on Linux. A lot of effort is involved with calling the low-level functions exposed by the libc crate correctly. Writing unsafe Rust is a lot harder than safe Rust! Especially crafting the control messages is tricky, but once you can do those you can do a lot more fun things besides using GSO.

There are a few attempts at making some of this more available to safe Rust, but nothing I found was widely used, plus crates tended to involve some serious trade-offs to what the usage-pattern is. It would be nice if someone managed to get these abstractions right along the line of IoSlice in the standard library, allowing much more of this unsafe code to be more central and well-reviewed for soundness.

In the meantime this does give a good flavour of how to directly interact with the kernel from Rust. And should give enough insight to take to other networking areas, e.g. there are differences with TCP but it should not be that alien anymore.

Feedback

Feel free to reach out to me on mastodon for comments and feedback.