Clone and Fork with File Descriptors: The Proposal

I’ve talked before about the problems I’ve had with the current clone/fork situation, now it’s time to get into solutions.

So, without further ado, here’s my two cents:

NAME
     forkfd - create a child process and a process descriptor

SYNOPSIS
     int forkpd(int flags);

DESCRIPTION
     forkpd() creates a new process, in a manner similar to fork(2), and
     returns a process descriptor (pd), a special file descriptor that
     represents a process.
     
     Creating a child process through forkpd() disables conventional SIGCHLD
     and wait(2) handling for the process. Effectively, the process represented
     by a process descriptor does not exist for the purpose of wait(2) and
     SIGCHLD handling. Instead, the parent receives status information by
     calling read(2) on the process descriptor, and the process is
     automatically reaped by the kernel. See below for additional information.
        
     forkpd() can accept these flags to modify the descriptor's behavior:
        
     PD_NONBLOCK
          Set the O_NONBLOCK file status flag on the new open process
          descriptor. Using this flag saves extra calls to fnctl(2) to achieve
          the same result.
        
     PD_DAEMON
          Instead of the default terminate-on-close behavior, allow the process
          to live until it is explicitly killed.
         
     Additionally, it can accept these flags to modify the namespaces of the
     child process:
        
     PD_NEWCGROUP
          Create a new cgroup namespace
        
     PD_NEWIPC
         Create a new IPC namespace
        
     PD_NEWNET
         Create a new network namespace
        
     PD_NEWMOUNT
         Create a new mount namespace
        
     PD_NEWPID
         Create a new PID namespace
        
     PD_NEWUSER
         Create a new user namespace
        
     PD_NEWUTS
         Create a new UTS namespace
                
     See namespaces(7) for more information.
        
     The following system calls have effects specific to process descriptors:
        
     fstat(2)
         Queries the status of a process descriptor. Of particular note,
         if the owner read, write, and execute bits are set then the process
         represented by the process descriptor is alive.
     
     read(2) (and similar)
         Reading from the process descriptor returns a single pd_info
         structure. If the process is running and has not been stopped, read()
         will block until the process is stopped or exits.
         
         struct pd_info {
            uint32_t code;   /* signal code */
            uint32_t status; /* exit status, not valid child is stopped */
         };
         
         If a read is attempted after the last pd_info is retrieved (the
         process is dead), then EBADF is returned.
     
     write(2) (and similar)
         Writing a pd_sig struct to the process descriptor is equivalent to
         calling sigqueue(3) with the given sig and sival_int;
         
         struct pd_sig {
            uint32_t sig;       /* signal code */
            uint32_t sival_int; /* signal value */
         };
         
         If a write is attempted after the process is dead, EABADF is returned.
     
     poll(2), select(2), epoll(7) (and similar)
         The process descriptor is readable if the process has changed state.
         
     openat(2) (and similar)
         The process descriptor is treated as equivalent to the procfs folder
         of the child (/proc/{child_pid}) for the purpose of these functions.
     
     close(2)
         If the processes is till alive and this is the last reference to the
         process descriptor, the process will be terminated with SIGKILL unless
         PD_DAEMON is set.

RETURN VALUE
     On success, the process descriptor of the child process is returned in the
     parent, and 0 is returned in the child. The descriptor is guaranteed to be
     >2 to avoid conflict, and is created close-on-exec. On failure, -1 is
     returned in the parent, no child process is created, and errno is set
     appropriately.

ERRORS
     forkfd() can return any errors from fork(), the namespace-related errors
     from clone(2), as well as the following additional errors:
        
         EINVAL flags contained an unknown flag.
         
         EMFILE Creation of the process descriptor would exceed the process
                limit on open file descriptors.
         
         ENFILE Creation of the process descriptor would exceed the system-wide
                limit on open file descriptors.

Acknowledgements go to FreeBSD, Thiago Macieira, and Josh Triplett for the work that I’ve both used for inspiration and stolen from, as the individual case may be.

For the purposes of the rest of this discussion, I’m going to assume an equivalent of Josh Triplett’s clone4() backs forkpd. I may eventually write my own, but it is close enough for now.

Let’s get into some reasoning. There are a few primary motivations for this whole base of work:

  • remove the need for the mess that is SIGCHLD handling
  • allow libraries to use child processes without interfering with the application or each other
  • allow process management to integrate with select/poll/epoll
  • upgrade namespaces to first-class part of process forking
  • simplify process management where possible

As I’ve said in previous posts, signal handling is a mess, and as a result of that mess library writers effectively cannot privately use child processes. Even the otherwise cleanest answer to signals, signalfd is global, so library writers cannot use it. As such, our clone and fork replacements should be able to restrict child process information to the calling code, not the calling thread or process. Waiting for and reading child state change information should be simpler than the equivalent signal handling and wait/waitpid calls. This is one of the fundamental justifications for returning a special file descriptor (process descriptor, or pd) instead of a process ID (pid). PIDs simply aren’t well-integrated with standard *nix APIs, whereas the basic credo of Unix, “everything is a file”, is really in reference to file descriptors, so it’s a natural choice. Waiting for process state changes becomes as simple as using the normal select, poll, or epoll, and reading the change is a simple read.

The inclusion of namespace flags in the new fork equivalent is two parts personal preference, and one part acknowledgement of the simple reality that namespace management is that important now. Forcing developers to either call the syscall themselves or use the horribly mismatched clone() libc call is just an unnecessary waste of their time.

The other potential big win for switching to a fd-based approach is the additional possibilities it adds for simplifying all parts of working with processes. Ideally, this set of changes would allow for code to completely ignore the existence of PIDs, and instead work entirely within a process-fd world with as few new system calls as possible. Everything that can be done through the file descriptor should be. This would require quite a few changes, and I fully admit that some of my ideas may be odd or suboptimal. I don’t pretend to be an expert on the kernel or libc API design. Additionally, it’s important to note that while I took some inspiration from FreeBSD’s pdfork, I feel no obligation to match it or even be compatible with it, especially as a lot of what I’m interested in is Linux-specific.

Overall goals and disclaimer out of the way, some specifics. First, as I want to remove pids entirely from the equation, the libc functions do not return them at all. Instead the return value is a file descriptor >2 (due to not interfering with stdin/out/err or the zero-for-child convention) with some special semantics. Conceptually it’s sort-of-a-pipe of process changes, with read and write to it meaning reading and writing to the process state, or the rough equivalents of wait and kill, respectively. Now, I’m still not sure whether text-based or binary messages are better for that, as they both have their ups and downs. For now, I’m going with binary c-structs, though given it’s already IPC and efficiency is sort of a moot point, I can easily see a case for something more easy to work with from higher-level languages. The specifics of what’s in the struct come largely from attempting to be equivalent to the wstatus struct from the wait family of functions, but without the weird int-packing thing. I don’t know the internals of the kernel well enough to know what would be convenient to add, nor do I yet have a firm idea of what would be best to include from a user perspective. Honestly, I’d pull more from the latter than the former, but sometimes easily available information can prove surprisingly useful.

The idea for fstat returning meaningful responses came from the pdfork implementation on FreeBSD, and I went with it both because it’s potentially useful and because the idea of making fstat relevant was appealing. Similarly, the idea of dup seemed useful on the surface, but the semantics of reading from dup’d descriptors seems broken for normal use cases, as each reader would probably want to get all events, not just the first to read. As such, a useful future enhancement would be to be able to open pds for arbitrary processes. This would be useful both for multiple notification and control streams for a process, as well as allowing for all of ptrace’s functionality to be suborned into the pd system.

The integration that I have the most mixed feelings about is the openat interactions. The mixed metaphor of pipe-and-directory is a bit odd, but it is the most graceful way I can think of to include procfs functionality without requiring pids or new syscalls. Reading the status of the child becomes simply openat(pd, "status", O_CLOEXEC) followed by a call to read(). I’m not yet entirely sure how long this should be valid after the process terminates, whether or not it should be available until the fd is closed, or if it should be removed immediately upon the death of the process. The former is more convenient for the user and as such is my immediate preference, but there is some potential for the waste of kernel resources.

Still, aside from the exact nature of the structs and some other details, I’d be pretty happy to use this. I’m sure I’m missing some corner cases, and I’m sure this could be a royal pain for the kernel, but the gains to developers would be massive. Additionally, this forms a new set of semantics in dealing with processes that could easily be adapted to work with non-child processes, allowing for a much richer set of interactions that behave more in-line with the standard *nix semantics.