Skip to content

Latest commit

 

History

History
282 lines (219 loc) · 8.29 KB

File metadata and controls

282 lines (219 loc) · 8.29 KB

Layer 4: Resource Limits (cgroups)

Configuration Example

Resource limits are enforced at multiple levels: process API flags (application-level), cgroups (kernel-level), and Docker/Kubernetes settings (orchestrator-level). This layered enforcement ensures limits cannot be bypassed.

# From process manager flags
--memory-limit-bytes 4294900000    # ~4GB RAM
--cpu-shares 1024                  # CPU allocation
--oom-poll-interval-ms 50          # OOM detection

# From cgroups
/sys/fs/cgroup/memory/container_*/memory.limit_in_bytes: 4GB
/sys/fs/cgroup/cpu/container_*/cpu.shares: 1024

Cgroups Versions

cgroups (control groups) are a Linux kernel feature that limits, accounts for, and isolates resource usage (CPU, memory, disk I/O, network) for a collection of processes. Without cgroups, a container could consume all host resources causing denial of service.

Many production systems use cgroups v1 (backward compatibility). For new deployments, consider v2 which has a unified hierarchy and better resource management.

Configuration: Cgroups v1

File: /etc/systemd/system/sandbox-container@.service

[Unit]
Description=Sandbox Container for User %i
After=docker.service
Requires=docker.service

[Service]
Type=forking
User=root
ExecStartPre=/usr/local/bin/setup_cgroups.sh %i
ExecStart=/usr/bin/docker run \
    --runtime=runsc \
    --name=sandbox_%i \
    --cgroup-parent=/sandbox/container_%i \
    --memory=4g \
    --memory-swap=4g \
    --cpu-shares=1024 \
    --cpus=2.0 \
    --pids-limit=100 \
    --ulimit nofile=20000:20000 \
    your-sandbox-image
ExecStop=/usr/bin/docker stop sandbox_%i
ExecStopPost=/usr/local/bin/cleanup_cgroups.sh %i

[Install]
WantedBy=multi-user.target

File: setup-cgroups.sh

#!/bin/bash
# Setup cgroups for sandbox container

CONTAINER_ID=$1

# Create cgroup directories
mkdir -p /sys/fs/cgroup/memory/sandbox/container_${CONTAINER_ID}
mkdir -p /sys/fs/cgroup/cpu/sandbox/container_${CONTAINER_ID}
mkdir -p /sys/fs/cgroup/cpuacct/sandbox/container_${CONTAINER_ID}
mkdir -p /sys/fs/cgroup/pids/sandbox/container_${CONTAINER_ID}
mkdir -p /sys/fs/cgroup/devices/sandbox/container_${CONTAINER_ID}

# Memory limits (4GB)
echo 4294900000 > /sys/fs/cgroup/memory/sandbox/container_${CONTAINER_ID}/memory.limit_in_bytes
echo 4294900000 > /sys/fs/cgroup/memory/sandbox/container_${CONTAINER_ID}/memory.memsw.limit_in_bytes
echo 1 > /sys/fs/cgroup/memory/sandbox/container_${CONTAINER_ID}/memory.oom_control

# CPU limits
echo 1024 > /sys/fs/cgroup/cpu/sandbox/container_${CONTAINER_ID}/cpu.shares
echo 200000 > /sys/fs/cgroup/cpu/sandbox/container_${CONTAINER_ID}/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/sandbox/container_${CONTAINER_ID}/cpu.cfs_period_us

# PID limits
echo 100 > /sys/fs/cgroup/pids/sandbox/container_${CONTAINER_ID}/pids.max

# Device whitelist
cat > /sys/fs/cgroup/devices/sandbox/container_${CONTAINER_ID}/devices.allow << 'DEVICES'
c 1:3 rwm    # /dev/null
c 1:5 rwm    # /dev/zero
c 1:7 rwm    # /dev/full
c 1:8 rwm    # /dev/random
c 1:9 rwm    # /dev/urandom
c 5:0 rwm    # /dev/tty
c 5:2 rwm    # /dev/ptmx
c 136:* rwm  # /dev/pts/*
DEVICES

echo "Cgroups configured for container ${CONTAINER_ID}"

Configuration: Docker Compose

For Docker Compose deployments, resource limits are defined in the service definition. These settings apply to containers created by the Docker daemon.

File: docker-compose.yml

version: '3.8'

services:
  sandbox:
    image: your-sandbox-image
    runtime: runsc

    # Resource limits
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
        reservations:
          cpus: '0.5'
          memory: 1G

    # Ulimits
    ulimits:
      nofile:
        soft: 20000
        hard: 20000
      nproc:
        soft: 100
        hard: 100

    # PID limit
    pids_limit: 100

    # Memory settings
    mem_limit: 4g
    mem_reservation: 1g
    memswap_limit: 4g
    oom_kill_disable: false

    # CPU settings
    cpu_shares: 1024
    cpu_quota: 200000
    cpu_period: 100000

Configuration: Kubernetes

In Kubernetes, resource limits are defined in the Pod specification. The runtimeClassName: gvisor field is critical - it ensures the Pod uses gVisor instead of the default container runtime.

File: sandbox-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: sandbox-pod
  labels:
    app: sandbox
spec:
  runtimeClassName: gvisor

  containers:
    - name: sandbox
      image: your-sandbox-image

      resources:
        requests:
          memory: '1Gi'
          cpu: '500m'
          ephemeral-storage: '2Gi'
        limits:
          memory: '4Gi'
          cpu: '2000m'
          ephemeral-storage: '10Gi'

      # Security context
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        readOnlyRootFilesystem: false
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
          add:
            - NET_BIND_SERVICE

  # Pod-level settings
  securityContext:
    fsGroup: 1000
    sysctls:
      - name: net.ipv4.ip_unprivileged_port_start
        value: '0'

Resource Monitoring

Continuously monitor container resource usage to detect anomalies and impending resource exhaustion. This script reads from cgroup files and reports memory, CPU, and PID usage in real-time.

File: monitor-resources.sh

#!/bin/bash
# Monitor container resource usage

CONTAINER_ID=$1
CGROUP_PATH="/sys/fs/cgroup"

while true; do
    echo "=== Resource Usage for ${CONTAINER_ID} ==="
    echo "Timestamp: $(date)"

    # Memory usage
    MEMORY_USED=$(cat ${CGROUP_PATH}/memory/sandbox/container_${CONTAINER_ID}/memory.usage_in_bytes)
    MEMORY_LIMIT=$(cat ${CGROUP_PATH}/memory/sandbox/container_${CONTAINER_ID}/memory.limit_in_bytes)
    MEMORY_PCT=$((MEMORY_USED * 100 / MEMORY_LIMIT))
    echo "Memory: ${MEMORY_USED} / ${MEMORY_LIMIT} (${MEMORY_PCT}%)"

    # CPU usage
    CPU_USAGE=$(cat ${CGROUP_PATH}/cpuacct/sandbox/container_${CONTAINER_ID}/cpuacct.usage)
    echo "CPU Usage (nanoseconds): ${CPU_USAGE}"

    # PID count
    PID_CURRENT=$(cat ${CGROUP_PATH}/pids/sandbox/container_${CONTAINER_ID}/pids.current)
    PID_MAX=$(cat ${CGROUP_PATH}/pids/sandbox/container_${CONTAINER_ID}/pids.max)
    echo "PIDs: ${PID_CURRENT} / ${PID_MAX}"

    # Check for OOM
    OOM_COUNT=$(cat ${CGROUP_PATH}/memory/sandbox/container_${CONTAINER_ID}/memory.oom_control | grep "oom_kill" | awk '{print $2}')
    if [ "$OOM_COUNT" != "0" ]; then
        echo "[WARNING] OOM KILLS: ${OOM_COUNT}"
    fi

    echo ""
    sleep 5
done

Resource Limits by Use Case

Different workloads require different resource allocations. These recommendations balance performance with security - giving enough resources for the task while preventing abuse.

Use Case Memory CPU PIDs Disk
Code Execution 2-4GB 1-2 cores 100 5GB
Document Processing 4-8GB 2-4 cores 50 10GB
AI/ML Inference 8-16GB 4-8 cores 100 20GB
Web Browser 4-6GB 2-3 cores 200 5GB

OOM Configuration

When memory is exhausted, the Linux OOM killer terminates processes to free resources. These settings ensure critical host processes (like SSH and systemd) are protected while sandbox containers are prioritized for termination.

Prevent OOM from killing host:

# Set OOM score adjustment (lower = less likely to be killed)
echo -1000 > /proc/$(pidof systemd)/oom_score_adj  # Protect systemd
echo -1000 > /proc/$(pidof sshd)/oom_score_adj     # Protect SSH

# Container gets higher OOM score (more likely to be killed)
echo 1000 > /proc/$(pidof container_process)/oom_score_adj

Testing Resource Limits

After configuring limits, verify they are actually enforced. These tests confirm that containers cannot exceed their allocated resources and that resource abuse is properly contained.

# Test memory limit
docker exec sandbox bash -c 'stress --vm 1 --vm-bytes 5G --timeout 10s'
# Should fail with OOM (limit is 4GB)

# Test CPU limit
docker exec sandbox bash -c 'stress --cpu 4 --timeout 10s'
# Should use max 2 CPUs

# Test PID limit
docker exec sandbox bash -c 'fork_bomb(){ fork_bomb|fork_bomb & }; fork_bomb'
# Should stop at 100 processes