More Thoughts on Clone

I’ve recently mentioned my issues with clone(), though I stopped short of proposing something better. A big part of that is that if I were to propose something better, I’d have to address the other major issues with process creation on *nix.

First, a reminder on how this works. Typically a process on *nix that wants a child process calls fork(2) then one of the exec(2) family of functions. fork() well, forks a copy of the current process. If the call succeeds, the parent (original) process gets the pid (process identifier) of the child (new) process, which it can then use to access a bunch of meta-information about the process, or send some control signals. The child gets a copy of everything of the parent process, with a short list of exceptions. Typically, the child does some preparation then calls one of the exec()s.

This is a known, time-tested, flexible and powerful procedure that has worked for ages. It’s been used time and time again, in everything from the humble shell to apache and most everything else. It works well… except for a few major caveats, and then it winds up not really working all that well at all.

The first thing that winds up causing some potentially major issues in practice is that “fork copies almost everything” bit. While that’s incredibly powerful, it comes with one blatant issue: that copy includes all files opened with the default set of parameters. There is, on modern systems, a flag you can pass either to open() or fcntl(), FD_CLOEXEC, which closes the file on a call to the exec() family. Which is great! Except, it’s the behavior you usually want, and it’s not the default.

The second problem is signals. In theory, signals are way to inform a process about any sort of asyncronous event. One just installs a handler with signal(), or wait, no sigaction(), which then gets called whenever you receive the signal in question.

Except it sucks. Signals introduce so many awful corner cases to everything that they touch. Hell, every single syscall on *nix systems has EINTR as a possible error code in case you get a signal during exection. What do you do when you get EINTR? You retry, except when retrying is impossible, in which case nobody really knows what to do. Oh, and while we’re at it, your handlers themselves can be interrupted to handle a different signal, which can cause all sorts of fun issues. Do remember not to call anything that malloc()s or any non-reentrant functions during your handler. Really, even being reentrant isn’t enough, it needs to be “async-signal-safe”. In fact, just pipe it to yourself, it’s the only sane thing to do.

In relation to processes, you’re supposed to get SIGCHLD (sometimes called SIGCLD) when a child process terminates. Except, well, signals are broken. The big killer is coalescing. To elaborate, if you’re currently in your SIGCHLD handler and a second child process ends, the system is required to queue up the information for that process and call the handler again after, and all is fine. If, however, a third (or fourth, and so on) child terminates, the system just drops it. You get nothing at all. So SIGCHLD is useless for anything more than saying “some unknown number of child processes have ended”, and you need to call wait(2) anyways.

After this point, most of the issues become more frustrating than killer. As it turns out, pids are guaranteed to not reuse a currently in-use pid, but a previously used pid is totally valid. Linux pids wrap around after it exceeds the value in /proc/sys/kernel/pid_max (since 2.5.34). On 32-bit platforms you get a whole 32,768 processes before reuse, which is really not as much as it seems on a long-running machine. So don’t hold onto your pids for too long without checking if it’s dead.

In fact, pids themselves are a mistake, I would suggest, at least in terms of being the primary handle for processes. FreeBSD seems to agree, and has introduced the pdfork(2) family of functions. They still return a pid, but also give you a file descriptor that serves as the primary mode of interaction with the process. Not coincidentally, it turns off receiving SIGCHLD for any processes creating with it. Instead, you simply poll() or select() over your set of process fds, and it wakes you on process state transition. Currently it seems to only tell you about the process’s death, but that’s still leagues more useful than even signalfd().

Still, that’s FreeBSD, and I’m primarily concerned with Linux. On top of that, the function isn’t quite what I want. It doesn’t solve the cloneflags issue I mentioned in a previous post, nor does the file descriptor provide the niceties I’d like to see.

So what’s my proposal? That’s for tomorrow Monday next time.