Moving beyond fork() + exec()
Posted by jwilk 3 days ago
Comments
Comment by rom1v 3 days ago
> ABSTRACT
> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design. In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability. We catalog the ways in which fork is a terrible abstraction for the modern programmer to use, describe how it compromises OS implementations, and propose alternatives.
> As the designers and implementers of operating systems, we should acknowledge that fork’s continued existence as a first-class OS primitive holds back systems research, and deprecate it. As educators, we should teach fork as a historical artifact, and not the first process creation mechanism students encounter.
Comment by Animats 3 days ago
No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program. The original implementation worked by swapping out the forking program to disk on a fork() call. Then, at the moment the program was swapped out but control had not returned, the process table entry was duplicated and adjusted so that there were now two processes, one in memory and one swapped out. The one in memory then got control, and could do an exec() call.
This allowed large programs to run on small PDP-11 machines. It was needed back in the era of really expensive memory. That's why.
QNX had an interesting approach. Program loading isn't in the OS at all. There's "fork", but program loading is in a library. It links to a .so file which reads the executable header, allocates memory, loads the program, gets it ready to run, and starts it. The program loader runs in user space and is unprivileged. This is probably the right way to do it.
Comment by afiori 3 days ago
Comment by not_a_bijection 3 days ago
Comment by bluepuma77 3 days ago
Well, it seems we are back in an era with really expensive memory.
Comment by simongr3dal 2 days ago
Comment by BobbyTables2 3 days ago
“An era of really expensive memory”. That sounds familiar…
Comment by vanviegen 3 days ago
Comment by bregma 2 days ago
So on QNX, the spawned process does all the dynamic linking. The spawning process just sends an asynchronous message to the process manager and then gets on with things in a very deterministic manner as befitting a hard realtime system.
Comment by lukan 3 days ago
"In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability"
(But thanks for the good explanation)
Comment by duped 3 days ago
aiui this is what exec does, the problem outlined here is the split between process creation (expensive, kernel space, has to be done each time even if spawning the same process "template" repeatedly) and loading (cheap and in userspace).
Comment by cryptonector 3 days ago
> No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program.
Ironically vfork() is even better in this regard. I wish Unix had only ever had vfork().
Comment by dcrazy 3 days ago
Comment by purkka 3 days ago
Comment by krackers 3 days ago
Comment by fc417fc802 3 days ago
Comment by derriz 3 days ago
Comment by dooglius 1 day ago
Comment by cryptonector 3 days ago
Comment by JdeBP 3 days ago
The tricky part is setting up the initial process. The way out for that is static linking and re-use of the fact that the operating system kernel loader has to understand and be able to load (at least a small subset of) program image file formats too.
Comment by anarazel 3 days ago
I agree that there should be non-fork primitives, I'm just not that sure that performance is the best argument.
Comment by mort96 3 days ago
Now these decisions aren't objectively bad, but they have significant trade-offs and it's probably not a good idea that they're forced simply because we use fork()+exec() for process creation.
Comment by marcosdumay 3 days ago
At least on systems with virtual addressing. If you want to go into physical addressing, then yes, maybe it's a problem. But Linux will never touch anything with physical addressing, so I don't see what people are complaining about.
Comment by mort96 3 days ago
Comment by mpyne 3 days ago
Comment by vbezhenar 3 days ago
You're doing fork + exec.
If you're overcommiting, fork will not reserve another 600 MB, and exec immediately after fork will cause total system usage to be 601 MB.
If you're not overcommiting, that fork will fail, because total memory consumption will be 1200 MB which is more than 1GB. That somewhat restricts program design.
Comment by agwa 3 days ago
Comment by dwattttt 3 days ago
> You're doing fork + exec.
This is the clear problem: you don't want another process that's a duplicate of the current one, that's just a detail of what you actually want: a 1mb process. Right now it's a badly leaky detail which you're forced to work around.
Comment by cylemons 3 days ago
Does this accounting apply to vfork as well?
Comment by vbezhenar 1 day ago
Comment by cylemons 3 hours ago
> As with fork(2), the child process created by vfork() inherits copies of various of the caller's process attributes (e.g., file descriptors, signal dispositions, and current working directory); the vfork() call differs only in the treatment of the virtual address space, as described above.
so it seems Linux does define the behavior of vfork, but if you rely on it, your code won't be portable to other POSIX systems
Comment by cryptonector 3 days ago
Comment by Someone 3 days ago
It may not be slow, but for the common case where fork is almost immediately followed by exec in the process where fork returns zero fork increases those refcounts and exec almost immediately decreases them again hand does typically unnecessary checks whether refcounts became zero). A combined fork/exec syscall can avoid that work.
On the other hand, a sufficiently powerful combined fork/exec call has to have a lot of parameters that it has to check (whether to inherit open pipes, open files, setting the working directory, etc), and that slows it down.
That can be avoided by having multiple variants of combined fork/exec calls, but you would need lots of them to cover all combinations of flags.
I expect either approach should be faster then having fork, then exec as separate calls, especially when the process calling fork has many resources allocated.
Comment by thayne 3 days ago
Or you could create a hybrid between a thread and a process, where it still uses the parent's memory space (unlike fok), but has it's own stack (unlike vfork), and is in its own process (unlike a thread). I think this is technically possible on linux, but there isn't a readily available interface for it. Although it seems like posix_spawn could be implemented that way...
Comment by fc417fc802 3 days ago
That does seem like a much better design to me. But I wonder if that was considered way back at the dawn of computing and rejected for good reason?
> I think this is technically possible on linux, but there isn't a readily available interface for it.
Yes there is, see `man clone`. POSIX and glibc are quite different from the kernel in this regard. AFAIK under linux there are just threads of execution that might or might not share various namespaces and memory mappings. That said, the kernel does place a few artificial restrictions on what combinations are allowed in order to (as I understand it) guard against the unintended exercise of entirely untested combinations that serve no known practical purpose.
The practical problem is that if you start doing as you please with the various namespaces and mappings you quickly become incompatible with glibc and by extension most likely the majority of the dynamic libraries available on your system.
Comment by cryptonector 3 days ago
Though I want a posix_spawn-as-a-system-call approach as well / instead of that.
Comment by Someone 2 days ago
Create a thread in your own address space, and your process becomes multi-threaded. Create an address space, load some code in it, and create a thread there, and you fork/exec-ed.
In my memory, that OS was MACH, but Google doesn’t confirm that for me.
Comment by dcrazy 3 days ago
Comment by infogulch 3 days ago
Comment by thayne 3 days ago
Comment by tliltocatl 3 days ago
Comment by thayne 3 days ago
vfork helps a little, but it has a lot of restrictions on what you can do before the exec, and on unix that's basically the only place you can do things like close files, change signal masks, drop privileges or set up seccomp, etc.
Comment by cryptonector 3 days ago
Comment by thayne 1 day ago
That's a lot more restrictive. You can't use local variables, or call any functions other than _exit or execve. On linux specifically, I _think_ those restrictions are more relaxed and you can call async-signal-safe functions, however I'm not entirely clear on how relaxed that is, and as far as I understand that isn't portable.
Comment by cryptonector 1 day ago
The real restrictions are:
- you can't damage the function call frame
of the caller of vfork(), thus you can't
return from it
- you may only call async-signal-safe
functions on the child side of vfork()
That's basically it. Yes, you'll want to call execve(2) or _exit(2) before long, but there is no time limit as to that, it's just that the whole point of calling vfork() is to make it real cheap to spawn a process, which means ultimately calling execve(2), with _exit(2) being what you do if it execve(2) fails (e.g., because ENOENT).There is a ton of vfork()-using code that adheres to these real restrictions and has been working fine for decades. That includes several posix_spawn() implementations, the C shell, etc.
I demand evidence that this part: "the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork()" is remotely true. That evidence must be of the form of bug reports that were accepted and which stand to scrutiny.
I've never found any such evidence. Have you?
Meanwhile I have a proof by existence that vfork() is safe used much more liberally than you say it may be used.
> You can't use local variables, or call any functions other than _exit or execve.
There are other async-signal-safe functions, and they get used routinely by posix_spawn() and other code to do child-side setup before execve(2), including: I/O redirection, process group setup, signal handling changes, etc.
Comment by theK 3 days ago
Comment by mort96 3 days ago
Comment by theK 3 days ago
Comment by nvme0n1p1 3 days ago
(Windows's fork is called ZwCreateProcess)
Comment by Someone 3 days ago
I don’t know how they implemented it, though. Under the hood, it could do the equivalent of a fork/exec pair.
Comment by dcrazy 3 days ago
Comment by plorkyeran 3 days ago
Comment by dcrazy 3 days ago
Comment by nvme0n1p1 3 days ago
Comment by emmelaich 3 days ago
To avoid the problems, see roc's comment under the article. Esp use of a zygote process.
Comment by adgjlsfhk1 3 days ago
Comment by dapperdrake 3 days ago
Only being half facetious here. Maybe you or someone else really has a better take.
Comment by mort96 3 days ago
Comment by netbsdusers 2 days ago
Comment by foresto 3 days ago
Did someone suggest that it was?
Comment by mort96 3 days ago
Comment by pjmlp 3 days ago
Traditionally Windows applications that create processes all the time come from UNIX heritage.
Contrary to UNIX, Windows NT was designed with threads first mentality, from the get go.
While on UNIX they were added after fact, and to this day there are gotchas mixing posix threads with signals, fork and exec.
Comment by PaulDavisThe1st 3 days ago
Both systems are implemented using threads as the execution context, but in Unix, the history means that that you fork+exec most of the time, resulting in a two tasks that do not share memory any more. By contrast, on Windows (NT onward) the common case when creating a new execution context is to create a thread that shares memory with others in its process.
Both systems allow the easy use of the other's core abstraction. On Unix, you can either code like its 1986 and use fork without exec, or use clone(3) or any of its higher level abstractions like pthreads.
You're right that POSIX semantics get tangled when using threads.
Comment by JdeBP 3 days ago
The Unix model was invented over a decade before the idea of multithreading percolated into mainstream operating systems at all.
The reason that Windows NT started as it did, was that OS/2 had come out in 1987, with kernel threads, and the idea of multithreading had taken root. SunOS 5 gained threading, too.
Windows NT applications development began with threading available as a mechanism from the start, and with a lot of people in the IBM/Microsoft world already knowing about its use in applications development from OS/2.
Whereas with the Unices it came in more gradually, as the applications had often already been designed. The whole libthread versus libpthread thing made things interesting on SunOS for a few years, too. As did the first attempt (LinuxThreads) at providing threads on Linux.
Comment by thayne 3 days ago
Comment by PaulDavisThe1st 3 days ago
This obviously changed as pthreads came into being, and at this point, I suspect that the typical use for threads-sharing-memory and threads-not-sharing-memory is the same on most platforms.
A reminder that the task_t data structure describes threads and processes not just in Linux, but earlier Unixen also.
Comment by pjmlp 3 days ago
Which is why I took the effort to explicitly refer to Windows NT on my comment, already expecting some traditional answers from UNIX folks.
Also due to historical reasons POSIX threads are the outcome of every UNIX going their own way implementing threads, finally coming to an agreement years later, with all the plus and minus of relying in POSIX for portable code.
Comment by snozolli 3 days ago
How are those not simply child processes? I don't understand your use of the word 'threads' here.
Does the Unix world not distinguish between threads and processes? In Win32, threads exist within processes, and you can create new threads or child processes.
Comment by trumpdong 3 days ago
Second answer: Linux doesn't differentiate between threads and processes. It has a "thread group ID" that serves a small number of purposes, and the rest of the difference is just whether the threads happen to share the same address space.
Comment by pjmlp 3 days ago
The unit of execution is the thread.
On the UNIX world it depends on which UNIX you are talking about.
Linux has a similar model to Windows NT nowadays, hence clone() as key primitive.
Other UNIXes have different approaches.
Comment by PaulDavisThe1st 3 days ago
Comment by nine_k 3 days ago
Comment by sunshowers 3 days ago
Comment by pjmlp 3 days ago
Comment by mort96 3 days ago
Comment by pjmlp 3 days ago
Windows has a more rich set of IPC stuff than POSIX, especially since it has a microkernel like design.
If you are going to say it is everything on the same memory space anyway, it isn't.
Optional on Windows 10, and enforced on Windows 11, Hyper-V is always running, and several components including kernel and driver modules are sandboxed into their little worlds.
Several additional sandboxing changes were announced at BUILD.
Comment by mort96 3 days ago
Comment by pjmlp 3 days ago
This is how a http server back in the day would share the request context for the child process to reply back.
Comment by mort96 3 days ago
Comment by tliltocatl 2 days ago
Comment by yencabulator 2 days ago
Comment by tosti 2 days ago
Mozilla implemented an alternative to COM, called XPCOM. XP here means cross platform. Perhaps you could compare against that to take the platform out of the equation.
Comment by sunshowers 2 days ago
Comment by dcrazy 3 days ago
.NET tried this with app domains, which are now deprecated.
Comment by pjmlp 3 days ago
Also App Domains are partially back in .NET Core, isolation features aren't there, but code unloading is, via AssemblyLoadContext.
Comment by dcrazy 3 days ago
Comment by knome 3 days ago
Comment by pjmlp 3 days ago
Comment by zozbot234 3 days ago
Comment by JdeBP 3 days ago
* https://computernewb.com/~lily/files/Documents/NTDesignWorkb...
Comment by pstuart 3 days ago
Comment by epcoa 3 days ago
Comment by JdeBP 3 days ago
* https://computernewb.com/~lily/files/Documents/NTDesignWorkb...
Comment by epcoa 18 hours ago
Comment by dcrazy 3 days ago
Comment by JdeBP 3 days ago
Think it through. Windows NT supported fork from the start in its POSIX subsystem, that subsystem was layered on top of the Native API, and this is the Native API mechanism that the POSIX subsystem employed. Although it took until Gary Nebbett for someone to publicly show how, even though people knew informally back in 1993.
Comment by keitmo 3 days ago
Comment by dcrazy 3 days ago
Comment by peterfirefly 3 days ago
https://en.wikipedia.org/wiki/Windows_NT#Development
Windows NT was developed on various different CPUs before the Alpha was a thing. When it was released in 1993, it was released for three CPUs: IA-32, MIPS, and Alpha.
Comment by dcrazy 3 days ago
Raymond also says elsewhere that most WinNT engineers did development on i386, but doesn’t explicitly say what time period he is describing: https://devblogs.microsoft.com/oldnewthing/20250513-00/?p=11...
Comment by pjmlp 3 days ago
Misread on purpose to make a point?
Comment by aseipp 3 days ago
Comment by nvme0n1p1 3 days ago
Comment by dcrazy 3 days ago
Also, using the Zw prefix doesn’t make you look more knowledgeable, it makes you look like you’re trying way too hard to borrow credibility.
Comment by nvme0n1p1 3 days ago
Why does it matter which prefix I used? They both point to the same routine so my point applies either way.
Comment by netbsdusers 2 days ago
Comment by aseipp 3 days ago
Comment by omoikane 3 days ago
https://news.ycombinator.com/item?id=19621799 - A fork() in the road (2019-04-10, 178 comments)
Comment by jwilk 3 days ago
Comment by pizlonator 3 days ago
Hard to come up with an optimization that is equally efficient and elegant
Comment by toast0 3 days ago
I would guess it would be a small difference in measurable performance between zygote and a direct clean spawn, but it's one less trick an application needs to do, and it would be very helpful for libraries that spawn things. Spawning inside a library isn't always a great thing to do, but some things would really benefit from process level isolation.
[1] In case one isn't aware, the zygote pattern involves forking a 'zygote' process during application startup, and having that process do any forks that need to happen during application runtime. This reduces the cost of forking in large applications, because the zygote will have few fds open and use little memory. This lets your large application spawn new processes without delaying the application or the startup of the new processes. Some applications will spawn many zygotes to allow parallelism for spawning at runtime.
Comment by pizlonator 3 days ago
In all uses of zygotes that I have seen, here's what's really happening:
- `fork` is being used to reduce the cost of starting a process that has a high start-up cost. So, you start one process, run it through the expensive initialization, and then fork it from there to start new processes.
- To make this even faster, you have a pool of pre-forked processes sit around.
- Having pre-forked processes sitting around ready to be used is not expensive because of the CoW property and the fact that a process that forks and then immediately pauses will not have triggered any significant CoW yet.
So, the zygote optimization you speak of is in practice only meaningful on top of systems that are using an optimization uniquely enabled by `fork` (avoiding process initialization costs by cloning a process), and that zygote optimization is further optimized by another property of `fork` (memory sharing of forked processes that haven't done anything else yet).
Comment by toast0 3 days ago
> A zygote process is one that listens for spawn requests from a main process and forks itself in response. Generally they are used because forking a process after some expensive setup has been performed can save time and share extra memory pages.
I think reading the first sentance and stopping covers my zygote, but adding the second sentance covers yours. So I think we're both right!
I think both paths are useful. If your children need time to startup and become ready, spawn one that does start up work, and then it (pre)forks at the ready state to have processes ready to handle requests (your zygote). This does require a traditional fork() to avoid duplication of work.
But if forking is expensive at runtime because you have a million FDs open and a whole lot of memory allocations, spawn spawners before you start doing work (my zygote). This could be unnecessary with a inexpensive way to spawn a new process from an process that has lots of resources in use.
Of course, you can also use my zygotes to spawn your zygotes. Zygoteception.
[1] https://chromium.googlesource.com/chromium/src/+/HEAD/docs/l...
Comment by skydhash 3 days ago
While I’ve not bothered to profile it, but it seems that process that have lot of mapped pages is the issue (firefox, emacs,…). In the emacs case, the issue is when the main process trying to fork-exec, if I start a shell session (with shell-mode or term-mode), it works fine.
Comment by mpyne 3 days ago
Google may have popularized the term, but this approach was already in use by KDE developers in the KDE 2.x timeframe, where it was used as part of a system called kdeinit.
In this scheme, launching KDE apps from a KDE desktop could bypass much of the startup cost of dynamic linking by forking from a long-running kdeinit process (with kdeinit itself deliberately linked to all large dependency libs like Qt and kdelibs), dynamically loading the application logic (stored as a .so) and then launching the app.
This was more to save startup time due to how long it took to dynamically resolve a multitude of C++-based symbols back then, all the common logic came before the app's own main() would ever be called. But it did also save a bit of memory as well.
Comment by PaulDavisThe1st 3 days ago
It's called clone(2)
Comment by toast0 3 days ago
Comment by vlovich123 3 days ago
Comment by loeg 3 days ago
Comment by vlovich123 3 days ago
The reason to do a zygote in the first place could be solved with alternative special APIs that are safer and harder to misuse. But we have fork so there’s not as big of a demand despite the warts.
Comment by loeg 2 days ago
Comment by p_l 3 days ago
Yes, zygote pattern makes it easy to make fork() into bottleneck - it requires a lot more discipline and low level tricks (linker scripts, compiler-specific extensions, custom sections, low level dependencies on pagesize that get "fun" on ARM servers).
If you don't, you might wake up with fork() causing latency issues.
Comment by cyberax 3 days ago
Raw fork() is terrible. Instead we need a proper primitive to stop and make a snapshot of a process.
Comment by pizlonator 2 days ago
My trick for that is that the set of threads that I create pre fork have to be suspendable and resumable, preferably lazily (they resume when they are actually needed). So, the zygotes are sitting with those threads suspended. When they become active, they can do work immediately. They might lazily resume those threads as needed.
There are other idioms for this too.
> Raw fork() is terrible. Instead we need a proper primitive to stop and make a snapshot of a process.
Folks have been saying that it's terrible for as long as I can remember. But it's still there, because it's better than the alternatives
Comment by cyberax 2 days ago
Well, yes. You need to wait for all the threads to park themselves at safepoints. This can work if you control the whole runtime, and you don't use something that creates threads behind your back.
This is actually why I've always been interested in a better fork(), it has a lot of parallels with stop-the-world needed for GCs.
> Folks have been saying that it's terrible for as long as I can remember. But it's still there, because it's better than the alternatives
I don't think we have alternatives? Except maybe ptrace()?
Comment by cryptonector 3 days ago
Comment by up2isomorphism 3 days ago
Comment by sanderjd 3 days ago
Comment by 1718627440 3 days ago
Comment by yxhuvud 3 days ago
It shares way too much, and have huge use cases where it is really, really bad.
Comment by gmueckl 3 days ago
Comment by stefan_ 3 days ago
Comment by jonhohle 3 days ago
Comment by sanderjd 3 days ago
Comment by sanderjd 3 days ago
Comment by 7jjjjjjj 3 days ago
Isn't that what posix_spawn is for?
Comment by toast0 3 days ago
Comment by yxhuvud 3 days ago
Comment by JdeBP 3 days ago
And of course that has already been done. On NetBSD, posix_spawn() is a fully-fledged system call and much of the work is done in kernel mode.
* https://blog.netbsd.org/tnf/entry/posix_spawn_syscall_added
Comment by stabbles 3 days ago
Comment by sanderjd 3 days ago
Comment by anarazel 3 days ago
Comment by dnw 3 days ago
Comment by wongarsu 3 days ago
Comment by sanderjd 3 days ago
Comment by jerf 3 days ago
This is just an example of I don't even know how many things a modern-day process will share from its parent.
By "complicated" I do not even remotely mean "unsolvable". I just mean that if you really dig down into what it means to "share nothing" in a modern operating system, it's a lot richer than it was back when fork+exec was a practical solution. There's a lot of fuzzy things that could go either way when you say "shares nothing".
Comment by sanderjd 3 days ago
Comment by dcrazy 3 days ago
Comment by jerf 3 days ago
I also explicitly said this wasn't unsolvable. My point isn't about technical implementations or code, my point is that the casual "I want to share nothing about the parent process" thought in sanderj's mind, and presumably a lot others, is much more ill-defined than they realize. There's a lot more state that a process has than what file descriptors are open in a modern system.
Moreover, as things like "in which container is this running" demonstrate, those are also not "create a process that has nothing to do with this process", because, again, there's a lot more to "having to do with this process" than "what file descriptors are open".
Also, as the name might have been a clue, Linux has posix_spawn: https://linux.die.net/man/3/posix_spawn. It also has a thing called "clone": https://www.man7.org/linux/man-pages/man2/clone.2.html Nor do I claim this paragraph is an entire overview of all the ways of starting a process in Linux. If you want to understand what I mean by "lots of details in a modern OS", your assignment is to carefully read the entire "clone" man page, and you'll start to see what I mean, though I'm not sure even that is all the state associated with a process nowadays.
Comment by sanderjd 3 days ago
I don't think it is necessary (or the best implementation) to clone the parent process, in order to maintain important properties like the process tree / container state, etc. I recognize that it's a sorta neat hack, "well if we just start by cloning the parent, then we don't have to figure out what state to include!", but that just pushes the details to the child process needing to figure out what to exclude, which IMO is a worse default.
Comment by dcrazy 3 days ago
Other operating systems either have parallel APIs to fork (e.g. the posix_spawn syscall on macOS) or do not provide fork at all (Windows).
Comment by jerf 2 days ago
Comment by JoBrad 3 days ago
Comment by mrkeen 3 days ago
It's weird to leave out a mention of copy-on-write - the optimisation that means that you don't copy over all the memory.
Comment by tux3 3 days ago
That means you have to allocate new pages to hold a copy of all these structures, even if the actual memory pointed by the pages is shared. And walking all those structures to make a copy is still costly.
Comment by thamer 3 days ago
Even back in 2012 this blog post showed the high cost of this operation: https://redis.io/blog/testing-fork-time-on-awsxen-infrastruc...
On an m2.xlarge using ~25GB of RAM, fork() took 5.67 seconds. That's a long pause when Redis clients typically experience single-digit msec latency for most operations. Yes, that's only the time needed to copy the page table. It's surprising they don't mention huge pages, it seems like it would be a key consideration here.
No doubt hardware is faster 14 years later, but Redis instances likely use more RAM too. It'd be interesting to see this benchmark revisited.
Comment by epcoa 3 days ago
For the intended audience of such a paper this is base knowledge.
Comment by FooBarWidget 3 days ago
Comment by mort96 3 days ago
Comment by m00x 3 days ago
I guess it depends on how sensitive your application is to main thread pauses.
Comment by trumpdong 3 days ago
Comment by Joker_vD 3 days ago
Comment by tempest_ 3 days ago
Comment by cls59 3 days ago
Comment by josefx 3 days ago
Comment by j16sdiz 3 days ago
> Attempts (such as vfork()) have been made over the years to optimize for this case, but the pattern still is more expensive than it could be.
Basically vfork do a "stop the world".
Comment by cryptonector 3 days ago
vfork() does NOT stop the world in many / most implementations. The ones that do stop the world do it because someone misunderstood the whole "vfork() stops the parent process" -- yes, it stops the parent process in a pre-threads world, but it doesn't have to stop any other threads but the one that called vfork(). Indeed, many implementations don't do that.
(Someone once tried to make NetBSD's vfork() stop the world because that's what the pre-threading man page said it does. I did my utter best to keep that from happening at the time, and it didn't then. Hopefully no one tried again later.)
Comment by uecker 3 days ago
Comment by amluto 3 days ago
Windows, for all its many, many faults, did not use fork+exec and instead mostly has options for how one creates a process. It wasn’t done elegantly, but it was the right decision.
Comment by uecker 3 days ago
Any kind of replacement should aim for the same conceptual simplicity and power. Sadly, I fear that people driving development nowadays are more interested in building unbreakable walled gardens for advertisement or app stores, or trying to squeeze down the some small gain when used on the cloud. I am more interested in general computing on the user side.
Comment by dcrazy 3 days ago
Comment by uecker 3 days ago
Comment by IshKebab 3 days ago
Comment by uecker 2 days ago
Comment by IshKebab 2 days ago
Comment by uecker 2 days ago
Comment by IshKebab 2 days ago
I dunno, that's the best I can do for now. Maybe you can do better?
Comment by uecker 2 days ago
Comment by amluto 3 days ago
Pipes and redirections don’t need fork + exec. Neither do subshells.
Comment by uecker 3 days ago
Comment by dwattttt 2 days ago
Possibly the most common way to tell the child the value is by setting it as a CLI arg in CreateProcess.
Comment by uecker 2 days ago
Comment by dwattttt 2 days ago
How do you selectively pass on fds without having a global impact on your process?
Comment by uecker 2 days ago
Comment by dwattttt 2 days ago
There have been plenty of comments here about effective workarounds, multi-process architectures to keep fork cheap, zygotes... these are very specifically working around the problem, while trying to avoid admitting it's a problem.
Comment by uecker 1 day ago
Comment by dwattttt 1 day ago
It's an achievable result with less specialised parts, but there's a reason we don't write all of our logic in NAND gates.
I'm still only inferring what you meant by CreateProcess needing special facilities; I assume you meant opt-in fd inheritance. Saving a parameter call while forcing all fds to be shared/duplicated seems penny-wise but pound-foolish.
Comment by __david__ 3 days ago
Comment by JdeBP 3 days ago
* https://jdebp.uk/FGA/bernstein-on-ttys/cttys.html
Interestingly, on MS/PC/DR-DOS file descriptor 3 was stdaux. and file descriptor 4 was stdprn.
Comment by 1718627440 3 days ago
Comment by amluto 3 days ago
Comment by jonhohle 3 days ago
Comment by chasil 3 days ago
The Windows approach may be correct, but it suffers in performance from the POSIX perspective.
I have heard that WSL1 iimproves this.
Comment by amluto 3 days ago
Windows does not historically depend on fork(), so there was no native fork(), so Cygwin kludged it up.
Comment by JdeBP 3 days ago
Comment by jkrejcha 3 days ago
Though actually iirc werfault uses NtCreateUserProcess() to clone processes when writing out crash dumps to this day
Comment by burnt-resistor 3 days ago
If you want to greenfield re-engineer the world with all new system calls and a totally different execution model, feel free to go right ahead.
Comment by wvenable 3 days ago
― George Bernard Shaw, probably.
Comment by jkrejcha 3 days ago
For example
pidfd_t ps = spawn(); // creates a process stopped (kernel does this anyway by default)
setuid(ps, 33);
capset(ps, ...);
socket(ps, ...);
mmap(ps, ...);
process_vm_writev(ps, ...);
exec(ps, ...);
signal(ps, SIGCONT);
// error handling elided
I guess this is a little bit me being a bit of critical of the usual syscall APIs for not thinking about "what if I want to do this to another process I have access to" but...It also makes things like thread safety even reasonably doable with fork. I do agree though that stuff like CreateProcess which take in a gazillion parameters don't really make for the greatest of userspace APIs
Comment by uecker 3 days ago
But how often would one actually need this? And what are the semantics? Refer arguments (e.g. file descriptors) to the current process or the other one? How are cross-permissions handled? It seems a lot of complexity...
Someones proposed a ptrace_syscall which could achieve the same thing.
Comment by jkrejcha 3 days ago
Well, the idea is that it'd probably be close to the default API for spawning processes (and could even be the bedrock for posix_spawn and friends in libc (and potentially even "simple" fork cases[1])). fork/clone would be the special case
In most cases, most programs don't need special setup. Something like `ptrace_syscall` would also work for this and would be probably the way to do it with the backwards compat limitations of nowadays
ptrace-ability seems to be generally how permissions for this sort of thing are handled in general (see also procfs, process_vm_writev, ptrace, etc). The complication is a little bit around setuid programs but either you could special case execve to imply SIGCONT for setuid or have execve also imply a SIGCONT as well
[1]: Probably would be rare for a compiler to optimize it though
Comment by jcranmer 3 days ago
In an alternative world where fork+exec never existed, a lot of those "usual APIs" would probably have had an explicit pid argument to them that let you modify process configuration from a different process. (This is how Fuschia works, e.g.). There's a lot of benefit to this world: the most obvious is that you don't have to magic up some IPC system just to report configuration errors, but there's actually a good amount of utility in being able to have a manager process that is tweaking attributes of its children (e.g., debuggers would love it).
Comment by trumpdong 3 days ago
Comment by uecker 3 days ago
But frankly, I am not really seeing the value.
Comment by trumpdong 3 days ago
Comment by uecker 3 days ago
Comment by fonheponho 2 days ago
Unfortunately, the opposite is true, when the parent process is multi-threaded. In the child process, only one thread exists (the thread returning from fork()), but the memory is an exact copy of the parent's. As a result, the child may inherit locks (resident in memory) that are in acquired state, but have no owner threads -- the threads that are responsible for eventually releasing those locks in the child's copy of the process memory do not exist in the child. If the single thread in the child process (returning from fork()) attempts to take such a lock (before exec), it deadlocks. This is why POSIX says that only async-signal-safe functions may be called in a child process, between fork and exec. And then, for example, "malloc" is not such a function (at least per POSIX), so the fork-to-exec environment in the child process is an extremely uncomfortable one. You've got to preallocate everything in the parent, can't report errors to stderr, etc.
https://pubs.opengroup.org/onlinepubs/9799919799/functions/f...
https://pubs.opengroup.org/onlinepubs/9799919799/functions/V...
The fork(2) Linux manual page spells out the sam restriction.
https://man7.org/linux/man-pages/man2/fork.2.html
https://man7.org/linux/man-pages/man7/signal-safety.7.html
"pthread_atfork" exists, but is effectively unusable.
https://pubs.opengroup.org/onlinepubs/9799919799/functions/p...
Comment by uecker 2 days ago
Comment by fanf2 3 days ago
Comment by zzo38computer 3 days ago
I think one problem is that it is already how it is; making an entirely new operating system (that is not Linux, not GNU, and not POSIX) would solve it, but that is not the case here, so it would need to be done as it is.
One possibility would be a new function that creates a new empty child process, but the parent process specifies what system calls the child process executes, and can stop if specifying that exec or exit is (successfully) called by the child process, or if the parent process gives it the program memory to execute directly instead of using a file (since that use is also useful). The new function can still have some of the clone flags available. (I don't actually know how much better it would work.)
There are other possibilities as well.
The existing methods can also remain available for when they are helpful, but functions such as popen might be changed to use the new method.
Comment by trumpdong 3 days ago
Comment by __david__ 3 days ago
Comment by cryptonector 3 days ago
POSIX says nothing much about vfork() anymore. It was a mistake removing it. Zealots failed to understand that vfork() >> fork(). https://news.ycombinator.com/item?id=30502392
Comment by pjc50 3 days ago
Quick, what's the highest numbered open file descriptor in the your program?
This gets even worse if you have multiple threads running. Without looking it up, what is the state of all the various synchronization primitives in a forked process?
Comment by matheusmoreira 3 days ago
Comment by garaetjjte 3 days ago
Comment by uecker 3 days ago
Comment by trumpdong 3 days ago
Comment by PaulDavisThe1st 3 days ago
Comment by jcalvinowens 3 days ago
Yes, it's copy on write... but there is a linear relationship between the size of the process and the number of page table entries required to represent it.
Comment by themafia 3 days ago
This is not exactly fixed since you can vary the amount of memory each page maps with things like hugepages and the same process can run with different page sizes.
Comment by IshKebab 3 days ago
Comment by ajkjk 3 days ago
I am curious about what the best way to handle the example in the article of one process spawning many git subprocesses is. Surely it just doesn't make sense to repeatedly start git from scratch in the course of a long-running parent operation. What's the low cost abstraction for the same result, though?
Comment by spacechild1 3 days ago
Comment by kps 3 days ago
Otherwise you need multiple steps to create a process, fill it with something to run, and arrange for it to execute. Or like Win32 you permanently smush them together with other layers, like filesystems and object loaders and linkers.
Comment by IshKebab 3 days ago
Attempting to justify clone/exec as a reasonable design is just Stockholm syndrome.
Comment by kps 2 days ago
Clone-and-modify is pretty common in CAD.
> You don't create a new file by copying an existing one and then modifying it.
Clone-and-modify is almost universal in version control systems.
Comment by IshKebab 2 days ago
It's closer to copy-on-write. Also, it actually makes sense there because in 99.999% of cases a commit actually is a modified copy of its parent. That isn't true for process spawning.
Comment by Too 3 days ago
The only thing I want to inherit from the parent process is its cwd and environment variables, even those are often overridden. The rest can easily be passed explicitly through other channels like pipes or command line arguments.
Back to the example from the article. It makes no sense that a git-subprocess forked from a web server need to have any process state inherited from the web server.
Comment by kps 2 days ago
Yes, exactly. Cloning, as a process creation primitive, is the one thing that doesn't need to be concerned with other stuff.
> … a git-subprocess forked from a web server …
That's pulling in a whole load of assumptions that are distinct from process creation. You can have processes in an environment that has no concept of file system or persistent storage at all.
Comment by ajkjk 3 days ago
Comment by fluffybucktsnek 3 days ago
Comment by wmf 3 days ago
Comment by trumpdong 3 days ago
Comment by Panzerschrek 3 days ago
Comment by ktpsns 3 days ago
Comment by ComputerGuru 3 days ago
Comment by smj-edison 3 days ago
Comment by sanderjd 3 days ago
Comment by Chu4eeno 3 days ago
Comment by sanderjd 3 days ago
Comment by codedokode 3 days ago
My idea is that we could make a new syscall, for example "spawn", that creates a new empty process, loads some lightweight "loader" into it, and passes arbitrary configuration data. The loader configures the process and exec()'s the main program. This allows to avoid forking the memory and keep existing APIs, but still requires to fork file descriptors and other things.
Comment by nyrikki 3 days ago
(Sorry if you weren't joking) but yes, posix_spawn() has been a thing and in glibc fork is just a alias to clone()
Not exactly that OP idea, but fork/exec is legacy really.
Comment by MayCXC 1 day ago
Comment by mpweiher 3 days ago
- address space
- memory objects
- threads
Mix and match. A Task (process) is not a primitive, but a composite object combining address space with one or more threads. How you fill the address space with actual memory objects is up to you. Map from disk or COW your own address space...have fun!
https://developer.apple.com/library/archive/documentation/Da...
Comment by trumpdong 3 days ago
Comment by asveikau 3 days ago
If you contrast that with win32, where you optionally pack a bunch of initial values into a struct, win32 is a much more narrow, less pleasant, less freeform interface, where it is harder to introduce more features.
But I think there is already posix_spawn to imitate that philosophy on Unix-like OSs.
Comment by dcrazy 3 days ago
Comment by asveikau 3 days ago
Comment by loeg 3 days ago
What do you mean underestimated? You can do anything between fork and exec; there are no limitations.
Comment by asveikau 3 days ago
Comment by loeg 3 days ago
You're talking about libc (glibc) implementation details now; userspace programs running on the Linux kernel do not have to be implemented in C or use glibc's primitives. Your earlier comment I initially replied to was talking about kernel syscalls. Forked processes are free to invoke any syscall they want, not just dup2 or a handful of others.
Comment by asveikau 3 days ago
The forked child has only 1 thread in its process. If the parent's threads are holding a lock or are in the middle of mutating a shared data structure, you're fucked, because those threads are no longer running in your child's copy of the address space and will not finish their work. This issue is fundamental to how threads work and what fork(2) does.
Comment by loeg 3 days ago
Comment by asveikau 3 days ago
This means if the program is multi threaded, you cannot rely on calling malloc in the child, because at the time of the fork another thread could have happened to be inside malloc doing manipulations on the global heap.
Which means, practically speaking, "don't allocate memory between fork and exec".
If you want to be overly literal as you have been, you can call mmap and it will give you new pages, but who is really doing that? Not the random shared library code you might want to call into. Hell, even a lot of libc calls malloc.
Which means it's not safe to do a random library call between fork and exec.
See where I'm going with this? That's if your program is multi threaded. If it isn't, these things are most likely fine.
Comment by asveikau 3 days ago
Signal safety is not the same as this, but similar. I believe posix specifies what is signal-unsafe to be overly broad. But the unsafety isn't an illusion -- it's an emergent property from something being a bad idea given the primitives at work, there are broad categories of bugs that are easy to introduce due to the way it works. So for signals, posix declares a bunch of ill advised things to be undefined, and with good reason. This is an analogous scenario.
Comment by peterfirefly 3 days ago
Comment by dcrazy 3 days ago
Comment by loeg 3 days ago
Comment by asveikau 3 days ago
Comment by loeg 3 days ago
No, you absolutely did not: https://news.ycombinator.com/item?id=48427396
Literally nothing in that comment mentions or discusses threads.
> I did not say they are constrained in what syscalls they can make
You wrote: "The things you can do between fork and exec are sometimes underestimated. Off the top of my head, you can call dup2(), you can set a process group id, probably a few other things."
Those are all syscalls. You can also invoke any of the other ~hundreds of syscalls linux exposes, not only dup2, setpgid, and a "few" others.
Comment by lokar 3 days ago
Comment by 1718627440 3 days ago
If I use a library, I also need to start using threads and need to invent some core synchronization mechanism. I essentially are reinventing a small scheduler, when I already get this from the OS for free. Also know any crash in the third-party code will crash the whole program, the third-party code has access to the whole address space. With invoking a process you also have a standardized API implemented by the OS.
Comment by lokar 3 days ago
Comment by omoikane 3 days ago
I can recall just one program that's intentionally not implemented as a library, but I think people have since built a library on top of it:
https://dechifro.org/dcraw/#:~:text=Why%20don%27t%20you%20im...
Comment by sanderjd 3 days ago
Comment by lokar 3 days ago
Comment by kllrnohj 3 days ago
Comment by sanderjd 3 days ago
Comment by m132 3 days ago
Comment by sanderjd 3 days ago
Comment by aerzen 3 days ago
Comment by 1718627440 3 days ago
Comment by lokar 3 days ago
Comment by MBCook 3 days ago
Comment by lokar 2 days ago
Bash as a programming language is just a bad idea.
Comment by m132 3 days ago
Comment by sanderjd 3 days ago
Comment by m132 2 days ago
A lot of other Python implementations don't have the ability to spin up new processes at all too.
Comment by sanderjd 2 days ago
Comment by pizlonator 3 days ago
Comment by lokar 2 days ago
Comment by debatem1 3 days ago
Comment by MBCook 3 days ago
For launching something totally new, like the example in the article of some tool calling git, I think it does make a ton of sense to make something new.
Especially since I suspect that is by far the more common case. I suspect “I want a clone of me“ is relatively rarely used at this point.
Comment by debatem1 3 days ago
Comment by surajrmal 3 days ago
Comment by medoc 2 days ago
Comment by LoganDark 3 days ago
Comment by corbet 2 days ago
Comment by burnt-resistor 3 days ago
Every couple of years, someone claims they have "the solution" implying everyone else who came before them didn't know what they were doing.
Comment by yxhuvud 3 days ago
Comment by mike_hock 3 days ago
I.e. a year that starts with 20, not 19.
Comment by JdeBP 3 days ago
Comment by foo-bar-baz529 3 days ago
Comment by a-dub 3 days ago
Comment by zbentley 2 days ago
That, and even those clone-without-pagetable-copy improvements leave a lot of slowness on the table. Being able to skip even disable-able functionality intended for fork would simplify code. Also, for programs that launch the same subprocess many times, a better API might allow caching away some of the pre-entrypoint initialization of exec.
Comment by Sophira 3 days ago
Comment by zerobees 3 days ago
In fact, if you profile it, in the fork() + execve() model, execve() is far more expensive, because not only does it replace the old process with a new one, but it also involves running the dynamic linker, which opens, parses, and mmaps library files.
It still makes sense to get rid of the fork() overhead if you're going to throw away the cloned process state soon thereafter, but if you wanted to make process execution radically faster, rethinking the exec architecture would probably offer more significant gains.
Comment by corbet 3 days ago
Comment by nasretdinov 3 days ago
Comment by sanderjd 3 days ago
Comment by dijit 3 days ago
It might be commonly held convention, and thus, an assumption, in Linux (and, broadly, UNIX) but I don't think it's true inside VAX or even Windows, so I don't think it's a requirement.
Unless I've missed something (which is totally possible, this is not an area of OS design I've spent much time).
Comment by lokar 3 days ago
Comment by ggm 3 days ago
I do use threaded code. It's significantly harder to write and reason about. (45 years in to a CS career, ageing out)
You have to be clever to do better than clever people. Clever people bootstrapped me into fork()/exec() and I know my limits.
Comment by redleader55 3 days ago
Comment by skydhash 3 days ago
Yes, we’re not the one paying for Linux development, but its subsystems are so complicated for general purpose computing. Like fitting formula 1 car parts onto a camry.
Comment by tadfisher 3 days ago
x86 still runs in real mode on boot despite dropping the PC BIOS.
Lots of software still assumes a 4kb page size, to the point where migrating Android to 16kb is an ongoing multi-year effort involving far too many people. And this is an OS for phones, which you might assume would lack the memory to benefit from a larger page size.
And one of the most popular consumer CPUs for enthusiasts, the Ryzen X3D chips, broke assumptions in both Linux and Windows schedulers that all cores have access to the same amount of L3 cache.
I would probably not assume the kinds of hardware limitations that we have now will persist into the useful lifetime of current software. Splitting the OS into "consumer" and "enterprise" variants is one of those moves that would bake in a ton of assumptions and make things messier in the future.
Comment by skydhash 3 days ago
UEFI could have supported something like ELF and do away with real mode. Intel and Amd could have just introduced a new line of cpu and everyone could have transitioned to that (with maybe shims to soften the change). But everyone is all about backwards compatibility and compile once, runs for eternity.
Comment by skydhash 3 days ago
Comment by stevefan1999 3 days ago
Comment by tus666 2 days ago
Comment by high_byte 2 days ago
Comment by sathyayoshi 3 days ago
Comment by hparadiz 3 days ago
I mean maybe this has been optimized for already and I don't know what I'm talking about but maybe someone with more knowledge about the kernel knows? Is this something we simply can't optimize for because of security implications?
Comment by 201984 3 days ago
Editing to add: this deduplication is one of the greatest upsides to dynamic linking. Common libs like libgcc and libc only have to exist in memory once and can stay in CPU caches, whereas if they were statically linked into every binary, each binary would have a copy of that library that wouldn't be shared with anything else and you'd waste a lot of memory.
Comment by sjmulder 3 days ago
Comment by ptspts 3 days ago
Comment by monocasa 3 days ago
Comment by 201984 3 days ago
They can't, so even PIC code still has to have a relocation table that gets patched. It's in a different page than the code though, so code does still get reused.
Comment by monocasa 3 days ago
Comment by 201984 3 days ago
If not patching, what exactly would you call modifying part of the file?
Comment by monocasa 3 days ago
This isn't meant as a reductive take, but instead that there is a difference between completely describable in C like the contents of the .got section, and something like a .reloc section that actually has to understand the generated assembly in order to build the relocation table to load and link the executable. Both are linking, but I've saved "patching" for more brain surgery esque techniques. Like on mips, the jump instruction immediate is the bottom 26 bits of the absolute address of the target, so you're going through and modifying all of the jump instructions if you load it to somewhere it wasn't linked at.
Comment by t-3 3 days ago
Comment by saidinesh5 3 days ago
> The kernel keeps track of which file is mapped where, and can detect when a request is made to map an already mapped file again, avoiding physical memory allocation if possible.
Relevant stack overflow answer: https://stackoverflow.com/questions/61950951/linux-shared-li...
Comment by mlaretallack 3 days ago
Comment by monocasa 3 days ago
Unices have been sharing executable memory between processes longer than there's been mmap for user space to do the same thing themselves. I remember seeing it in the 2BSD kernel for instance.
Comment by BoingBoomTschak 3 days ago
Comment by 1718627440 3 days ago
Comment by johnthescott 3 days ago
Comment by sirsinsalot 3 days ago
In this case too, you think it is silly because you don't understand it. Your assumptions are wrong, making it seem silly.