The Compute workforce at Netflix is charged with managing all AWS and containerized workloads at Netflix, together with autoscaling, deployment of containers, subject remediation, and so forth. As a part of this workforce, I work on fixing unusual issues that customers report.
This explicit subject concerned a customized inside FUSE filesystem: ndrive. It had been festering for a while, however wanted somebody to sit down down and take a look at it in anger. This weblog publish describes how I poked at
/procto get a way of what was occurring, earlier than posting the problem to the kernel mailing checklist and getting schooled on how the kernel’s wait code truly works!
We had a caught docker API name:
goroutine 146 [select, 8817 minutes]:
internet/http.(*persistConn).roundTrip(0xc000658fc0, 0xc0003fc080, 0x0, 0x0, 0x0)
internet/http.(*Transport).roundTrip(0xc000420140, 0xc000966200, 0x30, 0x1366f20, 0x162)
internet/http.(*Transport).RoundTrip(0xc000420140, 0xc000966200, 0xc000420140, 0x0, 0x0)
internet/http.ship(0xc000966200, 0x161eba0, 0xc000420140, 0x0, 0x0, 0x0, 0xc00000e050, 0x3, 0x1, 0x0)
internet/http.(*Consumer).ship(0xc000438480, 0xc000966200, 0x0, 0x0, 0x0, 0xc00000e050, 0x0, 0x1, 0x10000168e)
internet/http.(*Consumer).do(0xc000438480, 0xc000966200, 0x0, 0x0, 0x0)
golang.org/x/internet/context/ctxhttp.Do(0x163bd48, 0xc000044090, 0xc000438480, 0xc000966100, 0x0, 0x0, 0x0)
/go/pkg/mod/golang.org/x/[email protected]/context/ctxhttp/ctxhttp.go:27 +0x10f
github.com/docker/docker/shopper.(*Consumer).doRequest(0xc0001a8200, 0x163bd48, 0xc000044090, 0xc000966100, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/github.com/moby/[email protected]/shopper/request.go:132 +0xbe
github.com/docker/docker/shopper.(*Consumer).sendRequest(0xc0001a8200, 0x163bd48, 0xc000044090, 0x13d8643, 0x3, 0xc00079a720, 0x51, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/github.com/moby/[email protected]/shopper/request.go:122 +0x156
github.com/docker/docker/shopper.(*Consumer).ContainerInspect(0xc0001a8200, 0x163bd48, 0xc000044090, 0xc0006a01c0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/github.com/moby/[email protected]/shopper/container_inspect.go:18 +0x128
github.com/Netflix/titus-executor/executor/runtime/docker.(*DockerRuntime).Kill(0xc000215180, 0x163bdb8, 0xc000938600, 0x1, 0x0, 0x0)
github.com/Netflix/titus-executor/executor/runner.(*Runner).doShutdown(0xc000432dc0, 0x163bd10, 0xc000938390, 0x1, 0xc000b821e0, 0x1d, 0xc0005e4710)
github.com/Netflix/titus-executor/executor/runner.(*Runner).startRunner(0xc000432dc0, 0x163bdb8, 0xc00071e0c0, 0xc0a502e28c08b488, 0x24572b8, 0x1df5980)
created by github.com/Netflix/titus-executor/executor/runner.StartTaskWithRuntime
Right here, our administration engine has made an HTTP name to the Docker API’s unix socket asking it to kill a container. Our containers are configured to be killed through
SIGKILL. However that is unusual.
kill(SIGKILL) needs to be comparatively deadly, so what’s the container doing?
$ docker exec -it 6643cd073492 bash
OCI runtime exec failed: exec failed: container_linux.go:380: beginning container course of precipitated: process_linux.go:130: executing setns course of precipitated: exit standing 1: unknown
Hmm. Looks like it’s alive, however
setns(2) fails. Why would that be? If we take a look at the method tree through
ps awwfux, we see:
_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/6643cd073492ba9166100ed30dbe389ff1caef0dc3d35
| _ [docker-init]
| _ [ndrive] <defunct>
Okay, so the container’s init course of remains to be alive, nevertheless it has one zombie youngster. What might the container’s init course of presumably be doing?
# cat /proc/1528591/stack
It’s within the means of exiting, nevertheless it appears caught. The one youngster is the ndrive course of in Z (i.e. “zombie”) state, although. Zombies are processes which have efficiently exited, and are ready to be reaped by a corresponding
wait() syscall from their dad and mom. So how might the kernel be caught ready on a zombie?
# ls /proc/1544450/activity
Ah ha, there are two threads within the thread group. One in all them is a zombie, perhaps the opposite one isn’t:
# cat /proc/1544574/stack
Certainly it isn’t a zombie. It’s making an attempt to change into one as arduous as it will probably, nevertheless it’s blocking inside FUSE for some purpose. To search out out why, let’s take a look at some kernel code. If we take a look at
zap_pid_ns_processes(), it does:
* Reap the EXIT_ZOMBIE kids we had earlier than we ignored SIGCHLD.
* kernel_wait4() will even block till our kids traced from the
* mother or father namespace are indifferent and change into EXIT_DEAD.
rc = kernel_wait4(-1, NULL, __WALL, NULL);
whereas (rc != -ECHILD);
which is the place we’re caught, however earlier than that, it has completed:
/* Do not permit any extra processes into the pid namespace */
which is why docker can’t
setns() — the namespace is a zombie. Okay, so we will’t
setns(2), however why are we caught in
kernel_wait4()? To grasp why, let’s take a look at what the opposite thread was doing in FUSE’s
* Both request is already in userspace, or it was pressured.
* Wait it out.
wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
Okay, so we’re ready for an occasion (on this case, that userspace has replied to the FUSE flush request). However
SIGKILL needs to be very deadly to a course of. If we take a look at the method, we will certainly see that there’s a pending
# grep Pnd /proc/1544574/standing
Viewing course of standing this fashion, you may see
0x100 (i.e. the ninth bit is ready) below
ShdPnd, which is the sign quantity similar to
SIGKILL. Pending alerts are alerts which have been generated by the kernel, however haven’t but been delivered to userspace. Alerts are solely delivered at sure instances, for instance when coming into or leaving a syscall, or when ready on occasions. If the kernel is presently doing one thing on behalf of the duty, the sign could also be pending. Alerts will also be blocked by a activity, in order that they’re by no means delivered. Blocked alerts will present up of their respective pending units as nicely. Nevertheless,
man 7 sign says: “The alerts
SIGSTOP can’t be caught, blocked, or ignored.” However right here the kernel is telling us that we have now a pending
SIGKILL, aka that it’s being ignored even whereas the duty is ready!
Effectively that’s bizarre. The wait code (i.e.
embrace/linux/wait.h) is used all over the place within the kernel: semaphores, wait queues, completions, and so forth. Certainly it is aware of to search for
SIGKILLs. So what does
wait_event() truly do? Digging by the macro expansions and wrappers, the meat of it’s:
#outline ___wait_event(wq_head, situation, state, unique, ret, cmd)
struct wait_queue_entry __wq_entry;
lengthy __ret = ret; /* express shadow */
init_wait_entry(&__wq_entry, unique ? WQ_FLAG_EXCLUSIVE : 0);
lengthy __int = prepare_to_wait_event(&wq_head, &__wq_entry, state);
if (___wait_is_interruptible(state) && __int)
__ret = __int;
So it loops perpetually, doing
prepare_to_wait_event(), checking the situation, then checking to see if we have to interrupt. Then it does
cmd, which on this case is
schedule(), i.e. “do one thing else for some time”.
prepare_to_wait_event() seems like:
lengthy prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)
unsigned lengthy flags;
lengthy ret = 0;
if (signal_pending_state(state, present))
* Unique waiter should not fail if it was chosen by wakeup,
* it ought to "eat" the situation we had been ready for.
* The caller will recheck the situation and return success if
* we had been already woken up, we can't miss the occasion as a result of
* wakeup locks/unlocks the identical wq_head->lock.
* However we have to make sure that set-condition + wakeup after that
* cannot see us, it ought to get up one other unique waiter if
* we fail.
ret = -ERESTARTSYS;
if (wq_entry->flags & WQ_FLAG_EXCLUSIVE)
It seems like the one method we will get away of this with a non-zero exit code is that if
signal_pending_state() is true. Since our name web site was simply
wait_event(), we all know that state right here is
TASK_UNINTERRUPTIBLE; the definition of
signal_pending_state() seems like:
static inline int signal_pending_state(unsigned int state, struct task_struct *p)
Our activity just isn’t interruptible, so the primary if fails. Our activity ought to have a sign pending, although, proper?
static inline int signal_pending(struct task_struct *p)
* TIF_NOTIFY_SIGNAL is not actually a sign, nevertheless it requires the identical
* conduct by way of making certain that we get away of wait loops
* in order that notify sign callbacks may be processed.
if (unlikely(test_tsk_thread_flag(p, TIF_NOTIFY_SIGNAL)))
Because the remark notes,
TIF_NOTIFY_SIGNAL isn’t related right here, regardless of its identify, however let’s take a look at
static inline int task_sigpending(struct task_struct *p)
Hmm. Looks like we must always have that flag set, proper? To determine that out, let’s take a look at how sign supply works. Once we’re shutting down the pid namespace in
zap_pid_ns_processes(), it does:
group_send_sig_info(SIGKILL, SEND_SIG_PRIV, activity, PIDTYPE_MAX);
which ultimately will get to
__send_signal_locked(), which has:
pending = (kind != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
complete_signal(sig, t, kind);
PIDTYPE_MAX right here as the sort is a bit of bizarre, nevertheless it roughly signifies “that is very privileged kernel stuff sending this sign, you must undoubtedly ship it”. There’s a little bit of unintended consequence right here, although, in that
__send_signal_locked() finally ends up sending the
SIGKILL to the shared set, as an alternative of the person activity’s set. If we take a look at the
__fatal_signal_pending() code, we see:
static inline int __fatal_signal_pending(struct task_struct *p)
return unlikely(sigismember(&p->pending.sign, SIGKILL));
To grasp what’s actually occurring right here, we have to take a look at
complete_signal(), because it unconditionally provides a
SIGKILL to the duty’s pending set:
however why doesn’t it work? On the prime of the perform we have now:
* Now discover a thread we will get up to take the sign off the queue.
* If the primary thread needs the sign, it will get first crack.
* Most likely the least stunning to the common bear.
if (wants_signal(sig, p))
t = p;
else if ((kind == PIDTYPE_PID) || thread_group_empty(p))
* There is only one thread and it doesn't should be woken.
* It should dequeue unblocked alerts earlier than it runs once more.
however as Eric Biederman described, mainly each thread can deal with a
SIGKILL at any time. Right here’s
static inline bool wants_signal(int sig, struct task_struct *p)
if (sigismember(&p->blocked, sig))
if (p->flags & PF_EXITING)
if (sig == SIGKILL)
So… if a thread is already exiting (i.e. it has
PF_EXITING), it doesn’t desire a sign. Take into account the next sequence of occasions:
1. a activity opens a FUSE file, and doesn’t shut it, then exits. Throughout that exit, the kernel dutifully calls
do_exit(), which does the next:
exit_signals(tsk); /* units PF_EXITING */
do_exit() continues on to
exit_files(tsk);, which flushes all recordsdata which are nonetheless open, ensuing within the stack hint above.
3. the pid namespace exits, and enters
zap_pid_ns_processes(), sends a
SIGKILL to everybody (that it expects to be deadly), after which waits for everybody to exit.
4. this kills the FUSE daemon within the pid ns so it will probably by no means reply.
complete_signal() for the FUSE activity that was already exiting ignores the sign, because it has
6. Impasse. With out manually aborting the FUSE connection, issues will dangle perpetually.
It doesn’t actually make sense to attend for flushes on this case: the duty is dying, so there’s no person to inform the return code of
flush() to. It additionally seems that this bug can occur with a number of filesystems (something that calls the kernel’s wait code in
flush(), i.e. mainly something that talks to one thing exterior the native kernel).
Particular person filesystems will should be patched within the meantime, for instance the repair for FUSE is here, which was launched on April 23 in Linux 6.3.
Whereas this weblog publish addresses FUSE deadlocks, there are undoubtedly points within the nfs code and elsewhere, which we have now not hit in manufacturing but, however virtually actually will. You may also see it as a symptom of other filesystem bugs. One thing to look out for if in case you have a pid namespace that received’t exit.
That is only a small style of the number of unusual points we encounter operating containers at scale at Netflix. Our workforce is hiring, so please attain out if you happen to additionally love crimson herrings and kernel deadlocks!