Skip to content

Latest commit

 

History

History
405 lines (348 loc) · 10.5 KB

File metadata and controls

405 lines (348 loc) · 10.5 KB

Layer 5: Process Isolation

Namespace Setup

Linux namespaces partition kernel resources so that one set of processes sees one set of resources while another set of processes sees a different set. This is the foundation of container isolation - each namespace type isolates a different aspect of the system.

Linux Namespaces Used in This Sandbox:

# Example namespace structure in sandboxed environment
ipc:[4026531839]      # IPC namespace (isolated shared memory)
mnt:[4026531840]      # Mount namespace (isolated filesystem)
net:[4026531841]      # Network namespace (isolated network stack)
pid:[4026531842]      # PID namespace (isolated process tree)
user:[4026531843]     # User namespace (UID/GID mapping)
uts:[4026531844]      # UTS namespace (isolated hostname)

Implementation Steps

The following configuration steps implement process isolation using user namespaces (for UID/GID remapping) and capability dropping (for privilege reduction).

1. Docker Configuration

File: /etc/docker/daemon.json

{
  "userns-remap": "default",
  "default-ulimits": {
    "nofile": {
      "Name": "nofile",
      "Hard": 20000,
      "Soft": 20000
    },
    "nproc": {
      "Name": "nproc",
      "Hard": 100,
      "Soft": 100
    }
  }
}

Enable user namespace remapping:

User namespace remapping maps container root (UID 0) to a non-privileged UID on the host (e.g., UID 100000). This ensures that even if an attacker escapes the container as "root", they have no privileges on the host system.

# Create subuid/subgid mappings
echo "dockremap:100000:65536" >> /etc/subuid
echo "dockremap:100000:65536" >> /etc/subgid

# Restart Docker
systemctl restart docker

2. Capability Dropping

Linux capabilities divide root privileges into discrete units. Instead of running containers as root (dangerous) or as unprivileged users (sometimes limiting), we drop all capabilities then add back only what's essential. This minimizes the attack surface while maintaining functionality.

Observed capabilities (minimal set):

# CapEff: limited capability mask (filtered by runtime)
# Typically includes: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FOWNER, CAP_FSETID,
#                     CAP_KILL, CAP_SETGID, CAP_SETUID, CAP_SETPCAP,
#                     CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, CAP_AUDIT_WRITE

Drop dangerous capabilities:

# Kubernetes Pod Security Context
securityContext:
  capabilities:
    drop:
      - ALL # Drop everything first
    add:
      # Only add back what's absolutely necessary
      - CHOWN # Change file ownership
      - DAC_OVERRIDE # Bypass file permission checks
      - FSETID # Set file capabilities
      - KILL # Send signals
      - SETGID # Set GID
      - SETUID # Set UID
      - NET_BIND_SERVICE # Bind to ports < 1024
      - SYS_CHROOT # Use chroot()
      - AUDIT_WRITE # Write audit logs

Docker run with capability control:

docker run --runtime=runsc \
  --cap-drop=ALL \
  --cap-add=CHOWN \
  --cap-add=SETUID \
  --cap-add=SETGID \
  --cap-add=NET_BIND_SERVICE \
  --security-opt=no-new-privileges:true \
  your-image

3. PID Namespace Isolation

The PID namespace gives containers their own process ID space. PID 1 in the container is the first process started in that container, completely separate from the host's PID 1 (systemd). This prevents containers from seeing or signaling host processes.

Check PID isolation:

# Inside container - should only see container processes
ps aux
# Should NOT see host processes

# PID 1 inside container
echo $$  # Should be PID 1 or close to it

# From host
docker top sandbox_container
# Should show all container processes

4. Prevent Privilege Escalation

The no-new-privileges flag prevents processes from gaining additional privileges after they start. This blocks setuid binaries (like sudo) from elevating privileges, containing potential exploits.

File: /etc/docker/daemon.json

{
  "no-new-privileges": true
}

Kubernetes:

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true

Process API

This Python class provides a secure interface for executing commands inside containers. It applies resource limits before execution and handles timeouts to prevent hanging processes.

File: process-manager.py

#!/usr/bin/env python3
"""
Secure process execution manager
"""

import os
import subprocess
import resource
import signal
from typing import List, Dict

class SecureProcessManager:
    def __init__(self,
                 max_memory_mb: int = 4096,
                 max_cpu_time_sec: int = 300,
                 max_processes: int = 100):
        self.max_memory = max_memory_mb * 1024 * 1024
        self.max_cpu_time = max_cpu_time_sec
        self.max_processes = max_processes

    def set_limits(self):
        """Set resource limits before process execution"""
        # Memory limit
        resource.setrlimit(
            resource.RLIMIT_AS,
            (self.max_memory, self.max_memory)
        )

        # CPU time limit
        resource.setrlimit(
            resource.RLIMIT_CPU,
            (self.max_cpu_time, self.max_cpu_time)
        )

        # Process/thread limit
        resource.setrlimit(
            resource.RLIMIT_NPROC,
            (self.max_processes, self.max_processes)
        )

        # File descriptor limit
        resource.setrlimit(
            resource.RLIMIT_NOFILE,
            (20000, 20000)
        )

        # File size limit (prevent large files)
        resource.setrlimit(
            resource.RLIMIT_FSIZE,
            (100 * 1024 * 1024, 100 * 1024 * 1024)  # 100MB
        )

    def execute_command(self,
                       cmd: List[str],
                       env: Dict[str, str] = None,
                       timeout: int = 30) -> Dict:
        """Execute command with security constraints"""

        try:
            # Create restricted environment
            safe_env = {
                'PATH': '/usr/local/bin:/usr/bin:/bin',
                'HOME': '/home/developer',
                'USER': 'developer',
                'SHELL': '/bin/bash'
            }
            if env:
                safe_env.update(env)

            # Execute with preexec_fn to set limits
            proc = subprocess.Popen(
                cmd,
                stdout=subprocess.PIPE,
                stderr=subprocess.PIPE,
                env=safe_env,
                preexec_fn=self.set_limits,
                start_new_session=True  # Create new process group
            )

            # Wait with timeout
            try:
                stdout, stderr = proc.communicate(timeout=timeout)
                return {
                    'returncode': proc.returncode,
                    'stdout': stdout.decode('utf-8', errors='replace'),
                    'stderr': stderr.decode('utf-8', errors='replace')
                }
            except subprocess.TimeoutExpired:
                # Kill entire process group
                os.killpg(os.getpgid(proc.pid), signal.SIGKILL)
                return {
                    'returncode': -1,
                    'stdout': '',
                    'stderr': 'Process killed: timeout exceeded'
                }

        except Exception as e:
            return {
                'returncode': -1,
                'stdout': '',
                'stderr': f'Execution error: {str(e)}'
            }

# Usage
if __name__ == '__main__':
    manager = SecureProcessManager()
    result = manager.execute_command(['python3', 'user_script.py'])
    print(f"Exit code: {result['returncode']}")
    print(f"Output: {result['stdout']}")

Seccomp Configuration

Seccomp (secure computing mode) filters which system calls a container can make. This profile allows only the syscalls needed for typical applications while blocking dangerous calls that could lead to privilege escalation or container escape.

File: /etc/docker/seccomp/sandbox-profile.json

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
  "syscalls": [
    {
      "names": [
        "accept",
        "access",
        "alarm",
        "bind",
        "brk",
        "chdir",
        "chmod",
        "chown",
        "clone",
        "close",
        "connect",
        "creat",
        "dup",
        "dup2",
        "execve",
        "exit",
        "fchmod",
        "fchdir",
        "fchown",
        "fcntl",
        "fdatasync",
        "flock",
        "fork",
        "fstat",
        "fsync",
        "ftruncate",
        "getcwd",
        "getdents",
        "getegid",
        "geteuid",
        "getgid",
        "getgroups",
        "getitimer",
        "getpeername",
        "getpgid",
        "getpgrp",
        "getpid",
        "getppid",
        "getresgid",
        "getresuid",
        "getrlimit",
        "getrusage",
        "getsockname",
        "getsockopt",
        "gettimeofday",
        "getuid",
        "ioctl",
        "kill",
        "lchown",
        "link",
        "listen",
        "lseek",
        "lstat",
        "madvise",
        "mkdir",
        "mincore",
        "mmap",
        "mprotect",
        "mremap",
        "msync",
        "munmap",
        "nanosleep",
        "open",
        "pause",
        "pipe",
        "poll",
        "pread64",
        "ptrace",
        "pwrite64",
        "read",
        "readlink",
        "readv",
        "recvfrom",
        "recvmsg",
        "rename",
        "rmdir",
        "rt_sigaction",
        "rt_sigprocmask",
        "rt_sigreturn",
        "sched_yield",
        "select",
        "sendfile",
        "sendmsg",
        "sendto",
        "setfsgid",
        "setfsuid",
        "setgid",
        "setgroups",
        "setitimer",
        "setpgid",
        "setregid",
        "setresgid",
        "setresuid",
        "setreuid",
        "setsid",
        "setsockopt",
        "setuid",
        "shmat",
        "shmctl",
        "shmget",
        "shutdown",
        "socket",
        "socketpair",
        "stat",
        "symlink",
        "sysinfo",
        "times",
        "truncate",
        "umask",
        "uname",
        "unlink",
        "vfork",
        "wait4",
        "write",
        "writev"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Apply seccomp profile:

After creating the seccomp profile, apply it to containers using the --security-opt flag. This restricts the container to only the allowed syscalls in the profile.

docker run --runtime=runsc \
  --security-opt seccomp=/etc/docker/seccomp/sandbox-profile.json \
  your-image