Skip to content

# Bug: taskService.Create can orphan a created task when chooseGuestRootfs() fails after successful inner task creation #767

@WHOIM1205

Description

@WHOIM1205

Bug: taskService.Create can orphan a created task when chooseGuestRootfs() fails after successful inner task creation

Description

taskService.Create invokes the underlying containerd task creation before performing guest rootfs pre-computation. If chooseGuestRootfs() returns a non-skip error after the inner task has already been created, the shim returns an error to containerd without rolling back the created task.

This leaves containerd believing task creation failed while the underlying task and reexec init process remain alive and registered, resulting in an unreconcilable state and resource leakage.

Affected Code

File: pkg/containerd-shim/task_service.go

Function: (*taskService).Create

Current flow:

resp, err := s.TaskService.Create(ctx, r)
if err != nil {
    return resp, err
}

if err := chooseGuestRootfs(r); err != nil {
    if errors.Is(err, errGuestRootfsChoiceSkipped) {
        return resp, nil
    }

    log.G(ctx).WithError(err).Warn("urunc(shim): failed to choose guest rootfs")
    return nil, err
}

Problem

The inner task is already committed when chooseGuestRootfs() executes.

Execution sequence:

  1. s.TaskService.Create() succeeds.
  2. The underlying runtime executes urunc create.
  3. The reexec init process is spawned and the task is registered in runtime state.
  4. chooseGuestRootfs() runs afterward.
  5. A non-skip error occurs (e.g. missing hypervisor binary, ErrVMMNotInstalled, unsupported rootfs mode, or VMM initialization failure).
  6. The shim returns nil, err.
  7. Containerd treats Create as failed despite a live task already existing.

Why This Is A Bug

The failure occurs after task creation has already been committed.

Unlike other post-create operations in the same function:

  • OpenSession() failures are logged and ignored.
  • InjectUruncAnnotations() failures are logged and ignored.
  • errGuestRootfsChoiceSkipped is treated as non-fatal.

Only guest rootfs pre-computation aborts the Create request despite running after successful task creation.

Additionally, the rootfs choice is not strictly required because the runtime already recomputes it during Exec when the annotation is absent.

Impact

This can leave:

  • orphaned reexec init processes,
  • leaked mounts,
  • leaked namespaces,
  • leaked tap devices for dynamically-networked workloads,
  • tasks registered internally while containerd believes creation failed.

Cleanup may also fail because the runtime sees the init process as still running and rejects deletion.

Operationally this can result in:

  • Pods stuck in ContainerCreating or failed-but-not-cleaned states.
  • Failed container reconciliation.
  • Resource leakage and eventual node resource exhaustion.

Reproduction

  1. Configure a unikernel workload that passes validation but causes ChooseRootfs() to return a non-skip error.
    Examples:

    • Missing hypervisor executable.
    • Unsupported guest rootfs mode.
  2. Create the workload.

  3. Observe:

    • s.TaskService.Create() succeeds.
    • chooseGuestRootfs() fails.
    • The shim returns an error.
    • Containerd reports task creation failure.
  4. Verify:

    • The reexec init process remains alive.
    • Runtime state still exists.
    • Cleanup/delete may fail because the runtime considers the container running.

Root Cause

A post-commit optimization step is treated as a fatal operation.

The code returns an error after task creation has already succeeded and does not roll back the created task.

The same rootfs selection logic is already recomputed later during Exec, making the failure both harmful and redundant.

Proposed Fix

Treat guest rootfs pre-computation as best-effort and allow runtime fallback during Exec:

if err := chooseGuestRootfs(r); err != nil {
    if errors.Is(err, errGuestRootfsChoiceSkipped) {
        log.G(ctx).WithError(err).Debug("urunc(shim): guest rootfs choice skipped")
    } else {
        log.G(ctx).WithError(err).Warn(
            "urunc(shim): failed to choose guest rootfs; runtime will recompute at Exec",
        )
    }

    return resp, nil
}

return resp, nil

Alternatively, if the failure is intended to remain fatal, the shim should explicitly roll back the already-created task before returning the error.

Expected Behavior

After successful inner task creation:

  • Create should not fail solely because guest rootfs pre-computation failed.
  • Runtime fallback logic should handle rootfs selection during Exec.
  • Task lifecycle should remain consistent and fully reconcilable by containerd.
  • No orphaned init processes or leaked resources should remain after a failed Create operation.

Environment

Current main branch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions