Skip to content

Increasing memory usage over time #1600

@bjee19

Description

@bjee19

Bug Overview

Over the past few releases in our NGINX Gateway Fabric longevity tests, we've noticed a trend of the NGINX/Gateway container (has nginx and nginx-agent running in it) slowly increasing in memory consumption over the 72 hour period of time. The scenario in the test is a rolling out of backend deployment apps on a set cadence, meaning the NGINX conf gets changed and needs to be updated via nginx-agent on a set cadence. Traffic is also sent to the backend apps constantly, though after testing locally I don't think that affects the memory of nginx-agent.

I've since verified after using exec to get into the container and running top, which shows the memory consumption below:

Mem total:65848320 anon:927172 map:899296 free:61693900
 slab:247348 buf:111648 cache:2597796 dirty:136 write:0
Swap total:0 free:0
  PID   VSZ^VSZRW^  RSS (SHR) DIRTY (SHR) STACK COMMAND
   27 1455m  287m  268m     8  224m     0   132 nginx-agent
19450 18184  6416 11028  9212  5940  4124   132 nginx: worker process
19444 18136  6368 10944  9192  5876  4124   132 nginx: worker process
19434 18116  6348 11228  9256  5876  4032   132 nginx: worker process
19435 18092  6324 10896  9152  5864  4120   132 nginx: worker process
19446 18088  6320 10980  9228  5884  4132   132 nginx: worker process
19438 18084  6316 10972  9232  5848  4140   132 nginx: worker process
19442 18068  6300 10932  9220  5836  4124   132 nginx: worker process
19437 18056  6288 10944  9236  5848  4140   132 nginx: worker process
19439 18056  6288 10944  9220  5848  4124   132 nginx: worker process
19440 18044  6276 10924  9228  5828  4132   132 nginx: worker process
19441 18044  6276 10916  9244  5820  4148   132 nginx: worker process
19447 18028  6260 10896  9244  5800  4148   132 nginx: worker process
19445 18024  6256 10896  9216  5804  4124   132 nginx: worker process
19449 17980  6212 10848  9220  5752  4124   132 nginx: worker process
19443 17936  6168 10824  9220  5736  4132   132 nginx: worker process
19448 17932  6164 10832  9240  5736  4144   132 nginx: worker process
    8 17212  5444 12392  9116  5072  4136   132 nginx: master process /usr/sbin/nginx -g daemon off;
    1  2364   372  1808   584   280     0   132 {entrypoint.sh} /bin/bash /agent/entrypoint.sh
19726  1712   240  1212   888   192     0   132 sh
19732  1656   184  1000   920    72     0   132 top

Where the RSS shows 268 mb of physical memory being used by the nginx-agent process. Which was taking up the majority of the ~280 MiB physical memory used by the NGINX Container.

For some reason, NGINX Plus instances would have an increase in memory quite early on and stable out, while the NGINX OSS instances would hold off on memory consumption increasing until around the 2 day mark, then would quickly jump and end at around where the NGINX Plus instances were at.

You can view the results here:

Expected Behavior

I don't think the nginx-agent process should be consuming so much memory over time.

Steps to Reproduce the Bug

I'm guessing any kind of system where you have some sort of controller sending conf updates to nginx through nginx-agent rapidly over time would produce the same bug.

To do so with NGF:

Deploy these manifests too:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: rollout-mgr
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: rollout-mgr
rules:
- apiGroups:
  - "apps"
  resources:
  - deployments
  verbs:
  - patch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: rollout-mgr
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: rollout-mgr
subjects:
- kind: ServiceAccount
  name: rollout-mgr
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: coffee-rollout-mgr
spec:
  schedule: "* * * * *" # every minute
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: rollout-mgr
          containers:
          - name: coffee-rollout-mgr
            image: curlimages/curl:8.18.0
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            args:
            - |
                TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
                RESTARTED_AT=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
                curl -X PATCH -s -k -v \
                -H "Authorization: Bearer $TOKEN" \
                -H "Content-type: application/merge-patch+json" \
                --data-raw "{\"spec\": {\"template\": {\"metadata\": {\"annotations\": {\"kubectl.kubernetes.io/restartedAt\": \"$RESTARTED_AT\"}}}}}" \
                "https://kubernetes.default/apis/apps/v1/namespaces/default/deployments/coffee?fieldManager=kubectl-rollout" 2>&1
          restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: tea-rollout-mgr
spec:
  schedule: "* * * * *" # every minute
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: rollout-mgr
          containers:
          - name: coffee-rollout-mgr
            image: curlimages/curl:8.18.0
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            args:
            - |
                TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
                RESTARTED_AT=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
                curl -X PATCH -s -k -v \
                -H "Authorization: Bearer $TOKEN" \
                -H "Content-type: application/merge-patch+json" \
                --data-raw "{\"spec\": {\"template\": {\"metadata\": {\"annotations\": {\"kubectl.kubernetes.io/restartedAt\": \"$RESTARTED_AT\"}}}}}" \
                "https://kubernetes.default/apis/apps/v1/namespaces/default/deployments/tea?fieldManager=kubectl-rollout" 2>&1
          restartPolicy: OnFailure
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: coffee
spec:
  parentRefs:
  - name: gateway
    sectionName: http
  hostnames:
  - "cafe.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /coffee
    backendRefs:
    - name: coffee
      port: 80
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: tea
spec:
  parentRefs:
  - name: gateway
    sectionName: https
  hostnames:
  - "cafe.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /tea
    backendRefs:
    - name: tea
      port: 80

This should spin up a working NGINX Pod, 3 replicas of coffee and tea backend apps, with a cronjob rolling out those backend deployments every minute, causing constant nginx configuration updates. You can verify this is working by either viewing the NGF logs:

kubectl logs -n nginx-gateway <NGF-POD-NAME> and looking for NGINX configuration was successfully updated or Sent nginx configuration to agent logs.

Additionally, you can check the nginx configuration is correctly configured by running:
kubectl exec <gateway-pod-name> -- nginx -T and looking for the coffee and tea location portions to be present in the server block.

Once the system is running, exec into the gateway pod:

kubectl exec -it <gateway-pod-name> -- sh and run top, press S to view the memory, then keep an eye on the RSS column as that is the memory usage number. As time passes you should see the number slowly increase for the nginx-agent process.

Environment Details

  • Target deployment platform: Kind or GKE
  • Target OS: Not sure but I think they are all linux based
  • Version of this project or specific commit: 3.8.0
  • Version of any relevant project languages:

Additional Context

I've verified that an nginx-agent process on an NGINX container where there are no conf updates, has its memory usage stay relatively constant, aka there is no issue with it.

Agent with NGINX Plus conf:

command:
    server:
        host: my-release-nginx-gateway-fabric.nginx-gateway.svc
        port: 443
    auth:
        tokenpath: /var/run/secrets/ngf/serviceaccount/token
    tls:
        cert: /var/run/secrets/ngf/tls.crt
        key: /var/run/secrets/ngf/tls.key
        ca: /var/run/secrets/ngf/ca.crt
        server_name: my-release-nginx-gateway-fabric.nginx-gateway.svc
allowed_directories:
- /etc/nginx
- /usr/share/nginx
- /var/run/nginx
features:
- configuration
- certificates
- metrics
- api-action
labels:
    cluster-id: edd0806a-fe00-44ef-876d-5f822b1a727d
    control-id: b681d3c0-ac33-470d-9fdb-7bf5b298f11f
    control-name: my-release-nginx-gateway-fabric
    control-namespace: nginx-gateway
    owner-name: default_gateway-nginx
    owner-type: Deployment
    product-type: ngf
    product-version: edge
collector:
    log:
       path: "stdout"
    exporters:
        prometheus:
            server:
                host: "0.0.0.0"
                port: 9113
    pipelines:
        metrics:
            "default":
                receivers: ["host_metrics", "nginx_metrics"]
                exporters: ["prometheus"]

(NGINX OSS is the same except missing the api-action feature)

opentelemetry-collector-agent.yaml:

receivers:
  containermetrics:
    collection_interval: 1m0s
  hostmetrics:
    collection_interval: 1m0s
    initial_delay: 1s
    scrapers:
      network:
  nginxplus:
    instance_id: "e8d1bda6-397e-3b98-a179-e500ff99fbc7"
    api_details:
      url: "http://nginx-plus-api/api"
      listen: "unix:/var/run/nginx/nginx-plus-api.sock"
      location: "/api"
      ca: ""
    collection_interval: 1m0s

processors:
  resource/default:
    attributes:
      - key: resource.id
        action: insert
        value: 2ee2e2ff-ba65-36ac-b137-eccf604354a5
  batch/default_logs:
    send_batch_size: 100
    timeout: 1m0s
    send_batch_max_size: 100
  batch/default_metrics:
    send_batch_size: 1000
    timeout: 30s
    send_batch_max_size: 1000
  securityviolations/default: {}

  logsgzip/default: {}

exporters:
  otlp/default:
    endpoint: "my-release-nginx-gateway-fabric.nginx-gateway.svc:443"
    timeout: 10s
    retry_on_failure:
      enabled: true
      initial_interval: 10s
      max_interval: 60s
      max_elapsed_time: 10m
    tls:
      insecure: false
      insecure_skip_verify: false
      ca_file: "/var/run/secrets/ngf/ca.crt"
      cert_file: "/var/run/secrets/ngf/tls.crt"
      key_file: "/var/run/secrets/ngf/tls.key"
      server_name_override: "my-release-nginx-gateway-fabric.nginx-gateway.svc"
    auth:
      authenticator: headers_setter
  prometheus:
    endpoint: "0.0.0.0:9113"
    resource_to_telemetry_conversion:
      enabled: true
extensions:
  headers_setter:
    headers:
      - action: "insert"
        key: "authorization"
        value: "redacted"
      - action: "insert"
        key: "control-name"
        value: "my-release-nginx-gateway-fabric"
      - action: "insert"
        key: "owner-name"
        value: "default_gateway-nginx"
      - action: "insert"
        key: "owner-type"
        value: "Deployment"
      - action: "insert"
        key: "product-version"
        value: "edge"
      - action: "insert"
        key: "control-namespace"
        value: "nginx-gateway"
      - action: "insert"
        key: "product-type"
        value: "ngf"
      - action: "insert"
        key: "cluster-id"
        value: "edd0806a-fe00-44ef-876d-5f822b1a727d"
      - action: "insert"
        key: "control-id"
        value: "b681d3c0-ac33-470d-9fdb-7bf5b298f11f"
      - action: "insert"
        key: "uuid"
        value: "2ee2e2ff-ba65-36ac-b137-eccf604354a5"

service:
  telemetry:
    metrics:
      level: none
    logs:
      level: INFO
      output_paths: ["stdout"]
      error_output_paths: ["stdout"]
  extensions:
    - headers_setter

  pipelines:
    metrics/default:
      receivers:
        - containermetrics
        - hostmetrics
        - nginxplus
      processors:
        - resource/default
      exporters:
        - prometheus

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions