Yeah, the nominal libc restrictions are definitely pretty annoying. In my own pr...

the8472 · on May 25, 2024

> I've just taken to calling SYS_clone (or SYS_clone3) directly and treating it like _fork() w.r.t. which functions are safe to call from the child.

Things get really spicy when you want to use CLONE_VFORK to get fast process spawning and pidfds at the same time[0]. I think technically any syscall through libc would be illegal after that because errno is thread-local state and the vfork child isn't allowed to touch that. Regular vfork() handles this by updating thread state.

And it's not just libc. The kernel devs are imo a bit too lax when it comes to their API specs. E.g. I recently ran into an issue around the specification of the close() syscall. Unlike write() its manpage doesn't have the "other errors may occur" caveat, and yet FUSE can cause arbitrary errors to be returned from close(), including EBADF. When I requested clarification[1] they were neither willing to call the FUSE behavior a bug nor update the docs.

[0] https://sourceware.org/bugzilla/show_bug.cgi?id=26371 [1] https://lore.kernel.org/all/0a0a1218-a513-419b-b977-5757a146...

LegionMammal978 · on May 25, 2024

> Regular vfork() handles this by updating thread state.

It doesn't, though? At least not on x86 [0] [1]. Stuff like this is why I'm inclined to regard the libc rules as fictions of dubious utility.

> And it's not just libc. The kernel devs are imo a bit too lax when it comes to their API specs. E.g. I recently ran into an issue around the specification of the close() syscall. Unlike write() its manpage doesn't have the "other errors may occur" caveat, and yet FUSE can cause arbitrary errors to be returned from close(), including EBADF. When I requested clarification[1] they were neither willing to call the FUSE behavior a bug nor update the docs.

I mean, FUSE is by no means the only offender with syscall return values. Before execve()ing your program, I can install a seccomp filter that makes any syscall return any errno. Even infallible operations like sched_yield() can be made to return an error. So I operate on the principle that the results I get from a Linux syscall are whatever the environment wants me to see: it's the environment's responsibility not to do something totally schizophrenic.

(Well, except for the possibilities where a less-privileged FUSE mount could confuse a more-privileged process. But the others in that thread want to see specific scenarios for that, which makes sense to me. In any case, if you have a privileged process that you don't want less-privileged filesystems to blow up, you likely want special isolated handling for all syscalls accessing it.)

[0] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/uni...

[1] https://git.musl-libc.org/cgit/musl/tree/src/process/x86_64/...

the8472 · on May 26, 2024

> It doesn't, though? At least not on x86 [0] [1]. Stuff like this is why I'm inclined to regard the libc rules as fictions of dubious utility.

Hrm, interesting. At least the musl author claimed[0] that using clone invalidates the thread state in a way that (presumably) vfork() wouldn't.

> I mean, FUSE is by no means the only offender with syscall return values. Before execve()ing your program, I can install a seccomp filter that makes any syscall return any errno.

I would put Seccomp and FUSE are different buckets. Seccomp is more like ptrace, it's hooking right into the process like a debugger. If it's actively sabotaging your process you have already lost, it could make mmap return the same pointer twice for example. FUSE is different, the kernel sits between the process and the fuse server. So the kernel is in a position to uphold its API contract.

[0] https://github.com/rust-lang/rust/issues/89522#issuecomment-...