Bug Overview
Over the past few releases in our NGINX Gateway Fabric longevity tests, we've noticed a trend of the NGINX/Gateway container (has nginx and nginx-agent running in it) slowly increasing in memory consumption over the 72 hour period of time. The scenario in the test is a rolling out of backend deployment apps on a set cadence, meaning the NGINX conf gets changed and needs to be updated via nginx-agent on a set cadence. Traffic is also sent to the backend apps constantly, though after testing locally I don't think that affects the memory of nginx-agent.
I've since verified after using exec to get into the container and running top, which shows the memory consumption below:
Mem total:65848320 anon:927172 map:899296 free:61693900
slab:247348 buf:111648 cache:2597796 dirty:136 write:0
Swap total:0 free:0
PID VSZ^VSZRW^ RSS (SHR) DIRTY (SHR) STACK COMMAND
27 1455m 287m 268m 8 224m 0 132 nginx-agent
19450 18184 6416 11028 9212 5940 4124 132 nginx: worker process
19444 18136 6368 10944 9192 5876 4124 132 nginx: worker process
19434 18116 6348 11228 9256 5876 4032 132 nginx: worker process
19435 18092 6324 10896 9152 5864 4120 132 nginx: worker process
19446 18088 6320 10980 9228 5884 4132 132 nginx: worker process
19438 18084 6316 10972 9232 5848 4140 132 nginx: worker process
19442 18068 6300 10932 9220 5836 4124 132 nginx: worker process
19437 18056 6288 10944 9236 5848 4140 132 nginx: worker process
19439 18056 6288 10944 9220 5848 4124 132 nginx: worker process
19440 18044 6276 10924 9228 5828 4132 132 nginx: worker process
19441 18044 6276 10916 9244 5820 4148 132 nginx: worker process
19447 18028 6260 10896 9244 5800 4148 132 nginx: worker process
19445 18024 6256 10896 9216 5804 4124 132 nginx: worker process
19449 17980 6212 10848 9220 5752 4124 132 nginx: worker process
19443 17936 6168 10824 9220 5736 4132 132 nginx: worker process
19448 17932 6164 10832 9240 5736 4144 132 nginx: worker process
8 17212 5444 12392 9116 5072 4136 132 nginx: master process /usr/sbin/nginx -g daemon off;
1 2364 372 1808 584 280 0 132 {entrypoint.sh} /bin/bash /agent/entrypoint.sh
19726 1712 240 1212 888 192 0 132 sh
19732 1656 184 1000 920 72 0 132 top
Where the RSS shows 268 mb of physical memory being used by the nginx-agent process. Which was taking up the majority of the ~280 MiB physical memory used by the NGINX Container.
For some reason, NGINX Plus instances would have an increase in memory quite early on and stable out, while the NGINX OSS instances would hold off on memory consumption increasing until around the 2 day mark, then would quickly jump and end at around where the NGINX Plus instances were at.
You can view the results here:
Expected Behavior
I don't think the nginx-agent process should be consuming so much memory over time.
Steps to Reproduce the Bug
I'm guessing any kind of system where you have some sort of controller sending conf updates to nginx through nginx-agent rapidly over time would produce the same bug.
To do so with NGF:
Deploy these manifests too:
apiVersion: v1
kind: ServiceAccount
metadata:
name: rollout-mgr
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: rollout-mgr
rules:
- apiGroups:
- "apps"
resources:
- deployments
verbs:
- patch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: rollout-mgr
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: rollout-mgr
subjects:
- kind: ServiceAccount
name: rollout-mgr
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: coffee-rollout-mgr
spec:
schedule: "* * * * *" # every minute
jobTemplate:
spec:
template:
spec:
serviceAccountName: rollout-mgr
containers:
- name: coffee-rollout-mgr
image: curlimages/curl:8.18.0
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
args:
- |
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
RESTARTED_AT=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
curl -X PATCH -s -k -v \
-H "Authorization: Bearer $TOKEN" \
-H "Content-type: application/merge-patch+json" \
--data-raw "{\"spec\": {\"template\": {\"metadata\": {\"annotations\": {\"kubectl.kubernetes.io/restartedAt\": \"$RESTARTED_AT\"}}}}}" \
"https://kubernetes.default/apis/apps/v1/namespaces/default/deployments/coffee?fieldManager=kubectl-rollout" 2>&1
restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: tea-rollout-mgr
spec:
schedule: "* * * * *" # every minute
jobTemplate:
spec:
template:
spec:
serviceAccountName: rollout-mgr
containers:
- name: coffee-rollout-mgr
image: curlimages/curl:8.18.0
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
args:
- |
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
RESTARTED_AT=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
curl -X PATCH -s -k -v \
-H "Authorization: Bearer $TOKEN" \
-H "Content-type: application/merge-patch+json" \
--data-raw "{\"spec\": {\"template\": {\"metadata\": {\"annotations\": {\"kubectl.kubernetes.io/restartedAt\": \"$RESTARTED_AT\"}}}}}" \
"https://kubernetes.default/apis/apps/v1/namespaces/default/deployments/tea?fieldManager=kubectl-rollout" 2>&1
restartPolicy: OnFailure
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: coffee
spec:
parentRefs:
- name: gateway
sectionName: http
hostnames:
- "cafe.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /coffee
backendRefs:
- name: coffee
port: 80
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: tea
spec:
parentRefs:
- name: gateway
sectionName: https
hostnames:
- "cafe.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /tea
backendRefs:
- name: tea
port: 80
This should spin up a working NGINX Pod, 3 replicas of coffee and tea backend apps, with a cronjob rolling out those backend deployments every minute, causing constant nginx configuration updates. You can verify this is working by either viewing the NGF logs:
kubectl logs -n nginx-gateway <NGF-POD-NAME> and looking for NGINX configuration was successfully updated or Sent nginx configuration to agent logs.
Additionally, you can check the nginx configuration is correctly configured by running:
kubectl exec <gateway-pod-name> -- nginx -T and looking for the coffee and tea location portions to be present in the server block.
Once the system is running, exec into the gateway pod:
kubectl exec -it <gateway-pod-name> -- sh and run top, press S to view the memory, then keep an eye on the RSS column as that is the memory usage number. As time passes you should see the number slowly increase for the nginx-agent process.
Environment Details
- Target deployment platform: Kind or GKE
- Target OS: Not sure but I think they are all linux based
- Version of this project or specific commit: 3.8.0
- Version of any relevant project languages:
Additional Context
I've verified that an nginx-agent process on an NGINX container where there are no conf updates, has its memory usage stay relatively constant, aka there is no issue with it.
Agent with NGINX Plus conf:
command:
server:
host: my-release-nginx-gateway-fabric.nginx-gateway.svc
port: 443
auth:
tokenpath: /var/run/secrets/ngf/serviceaccount/token
tls:
cert: /var/run/secrets/ngf/tls.crt
key: /var/run/secrets/ngf/tls.key
ca: /var/run/secrets/ngf/ca.crt
server_name: my-release-nginx-gateway-fabric.nginx-gateway.svc
allowed_directories:
- /etc/nginx
- /usr/share/nginx
- /var/run/nginx
features:
- configuration
- certificates
- metrics
- api-action
labels:
cluster-id: edd0806a-fe00-44ef-876d-5f822b1a727d
control-id: b681d3c0-ac33-470d-9fdb-7bf5b298f11f
control-name: my-release-nginx-gateway-fabric
control-namespace: nginx-gateway
owner-name: default_gateway-nginx
owner-type: Deployment
product-type: ngf
product-version: edge
collector:
log:
path: "stdout"
exporters:
prometheus:
server:
host: "0.0.0.0"
port: 9113
pipelines:
metrics:
"default":
receivers: ["host_metrics", "nginx_metrics"]
exporters: ["prometheus"]
(NGINX OSS is the same except missing the api-action feature)
opentelemetry-collector-agent.yaml:
receivers:
containermetrics:
collection_interval: 1m0s
hostmetrics:
collection_interval: 1m0s
initial_delay: 1s
scrapers:
network:
nginxplus:
instance_id: "e8d1bda6-397e-3b98-a179-e500ff99fbc7"
api_details:
url: "http://nginx-plus-api/api"
listen: "unix:/var/run/nginx/nginx-plus-api.sock"
location: "/api"
ca: ""
collection_interval: 1m0s
processors:
resource/default:
attributes:
- key: resource.id
action: insert
value: 2ee2e2ff-ba65-36ac-b137-eccf604354a5
batch/default_logs:
send_batch_size: 100
timeout: 1m0s
send_batch_max_size: 100
batch/default_metrics:
send_batch_size: 1000
timeout: 30s
send_batch_max_size: 1000
securityviolations/default: {}
logsgzip/default: {}
exporters:
otlp/default:
endpoint: "my-release-nginx-gateway-fabric.nginx-gateway.svc:443"
timeout: 10s
retry_on_failure:
enabled: true
initial_interval: 10s
max_interval: 60s
max_elapsed_time: 10m
tls:
insecure: false
insecure_skip_verify: false
ca_file: "/var/run/secrets/ngf/ca.crt"
cert_file: "/var/run/secrets/ngf/tls.crt"
key_file: "/var/run/secrets/ngf/tls.key"
server_name_override: "my-release-nginx-gateway-fabric.nginx-gateway.svc"
auth:
authenticator: headers_setter
prometheus:
endpoint: "0.0.0.0:9113"
resource_to_telemetry_conversion:
enabled: true
extensions:
headers_setter:
headers:
- action: "insert"
key: "authorization"
value: "redacted"
- action: "insert"
key: "control-name"
value: "my-release-nginx-gateway-fabric"
- action: "insert"
key: "owner-name"
value: "default_gateway-nginx"
- action: "insert"
key: "owner-type"
value: "Deployment"
- action: "insert"
key: "product-version"
value: "edge"
- action: "insert"
key: "control-namespace"
value: "nginx-gateway"
- action: "insert"
key: "product-type"
value: "ngf"
- action: "insert"
key: "cluster-id"
value: "edd0806a-fe00-44ef-876d-5f822b1a727d"
- action: "insert"
key: "control-id"
value: "b681d3c0-ac33-470d-9fdb-7bf5b298f11f"
- action: "insert"
key: "uuid"
value: "2ee2e2ff-ba65-36ac-b137-eccf604354a5"
service:
telemetry:
metrics:
level: none
logs:
level: INFO
output_paths: ["stdout"]
error_output_paths: ["stdout"]
extensions:
- headers_setter
pipelines:
metrics/default:
receivers:
- containermetrics
- hostmetrics
- nginxplus
processors:
- resource/default
exporters:
- prometheus
Bug Overview
Over the past few releases in our NGINX Gateway Fabric longevity tests, we've noticed a trend of the NGINX/Gateway container (has nginx and nginx-agent running in it) slowly increasing in memory consumption over the 72 hour period of time. The scenario in the test is a rolling out of backend deployment apps on a set cadence, meaning the NGINX conf gets changed and needs to be updated via nginx-agent on a set cadence. Traffic is also sent to the backend apps constantly, though after testing locally I don't think that affects the memory of nginx-agent.
I've since verified after using
execto get into the container and runningtop, which shows the memory consumption below:Where the RSS shows 268 mb of physical memory being used by the nginx-agent process. Which was taking up the majority of the ~280 MiB physical memory used by the NGINX Container.
For some reason, NGINX Plus instances would have an increase in memory quite early on and stable out, while the NGINX OSS instances would hold off on memory consumption increasing until around the 2 day mark, then would quickly jump and end at around where the NGINX Plus instances were at.
You can view the results here:
Expected Behavior
I don't think the nginx-agent process should be consuming so much memory over time.
Steps to Reproduce the Bug
I'm guessing any kind of system where you have some sort of controller sending conf updates to nginx through nginx-agent rapidly over time would produce the same bug.
To do so with NGF:
gateway.yaml,cafe-secret.yamlandcafe.yamlfiles in this directory: https://github.com/nginx/nginx-gateway-fabric/tree/main/tests/suite/manifests/longevityDeploy these manifests too:
This should spin up a working NGINX Pod, 3 replicas of coffee and tea backend apps, with a cronjob rolling out those backend deployments every minute, causing constant nginx configuration updates. You can verify this is working by either viewing the NGF logs:
kubectl logs -n nginx-gateway <NGF-POD-NAME>and looking forNGINX configuration was successfully updatedorSent nginx configuration to agentlogs.Additionally, you can check the nginx configuration is correctly configured by running:
kubectl exec <gateway-pod-name> -- nginx -Tand looking for the coffee and tea location portions to be present in the server block.Once the system is running, exec into the gateway pod:
kubectl exec -it <gateway-pod-name> -- shand runtop, pressSto view the memory, then keep an eye on theRSScolumn as that is the memory usage number. As time passes you should see the number slowly increase for the nginx-agent process.Environment Details
Additional Context
I've verified that an nginx-agent process on an NGINX container where there are no conf updates, has its memory usage stay relatively constant, aka there is no issue with it.
Agent with NGINX Plus conf:
(NGINX OSS is the same except missing the
api-actionfeature)opentelemetry-collector-agent.yaml: