Versions Involved
- Concourse: 4.2.1 and 5.1.0
- concourse-summary: 6b447f0
Problem Description
concourse-summary will occasionally overwhelm our Concourse server with network connections to its atc processes. This causes the Concourse server's web workers to run out of file handles and be unable to function correctly.
Investigation reveals that our concourse-summary instance has ~1k ESTABLISHED connections to our Concourse server and each of our two web instances have ~1k ESTABLISHED connections between the atc process running on the web instance and the concourse-summary instance. (Yes, this mismatch seems a little strange.)
As one would expect, the following procedure shuts down these ESTABLISHED connections and gets us back in working order:
- Stop
concourse-summary
- Restart the
atc service on each Concourse web instance
- Wait fo the
atc services to come back up
- Start
concourse-summary
Expected Behavior
concourse-summary should only have enough network connections open to get its job done. Given that there are less than 200 connections open when we restart concourse-summary, ~1k connections seems to be too many connections.
More Details
We have seen this issue happen twice in the past ~four months. We do not currently know if this is a gradual increase in the number of ESTABLISHED connections, or if this happens suddenly.
Our web instances are behind a GCP TCP Regional Load Balancer.
Our concourse-summary instance is providing a summary of both our Concourse server (version 4.2.1) and the Wings Concourse server (version 5.1.0).
concourse-summary is deployed in a 2.4 PCF running on top of vSphere.
Unfortunately, we don't know what software (concourse-summary, Concourse, GCP Load Balancer) is at fault.
Versions Involved
Problem Description
concourse-summarywill occasionally overwhelm our Concourse server with network connections to itsatcprocesses. This causes the Concourse server'swebworkers to run out of file handles and be unable to function correctly.Investigation reveals that our
concourse-summaryinstance has ~1kESTABLISHEDconnections to our Concourse server and each of our twowebinstances have ~1kESTABLISHEDconnections between theatcprocess running on thewebinstance and theconcourse-summaryinstance. (Yes, this mismatch seems a little strange.)As one would expect, the following procedure shuts down these
ESTABLISHEDconnections and gets us back in working order:concourse-summaryatcservice on each Concoursewebinstanceatcservices to come back upconcourse-summaryExpected Behavior
concourse-summaryshould only have enough network connections open to get its job done. Given that there are less than 200 connections open when we restartconcourse-summary, ~1k connections seems to be too many connections.More Details
We have seen this issue happen twice in the past ~four months. We do not currently know if this is a gradual increase in the number of
ESTABLISHEDconnections, or if this happens suddenly.Our
webinstances are behind a GCP TCP Regional Load Balancer.Our
concourse-summaryinstance is providing a summary of both our Concourse server (version 4.2.1) and the Wings Concourse server (version 5.1.0).concourse-summaryis deployed in a 2.4 PCF running on top of vSphere.Unfortunately, we don't know what software (
concourse-summary, Concourse, GCP Load Balancer) is at fault.