If the database is inaccessible (e.g. due to overly restrictive security group rules, invalid credentials as might happen during credential cycling, sustained AWS maintenance) then the site becomes inaccessible. Initially with a 500 (served by gunicorn) and shortly afterwards a 504 (when AWS notices the failed task).
The nature of the outage is a 504 page which takes approximately 30 seconds to serve. This page is not YNR/DemocracyClub branded; it's either an un-styled page generated by the LoadBalancer (when the ALB hostname is being used) or an un-styled page generated by CloudFront (when the hostname mentioned in the FQDN parameter store value is used; see commit 625f1caf).
The reason it times out is that the application is unable to connect but doesn't time out, and so the gunicorn parent process sees an unresponsive process and then kills it. (Small aside: because the app isn't logging at this point, the messages shown in the task logs make reference to gunicorn killing an unresponsive process but with no indication why). The ECS-ecosystem eventually sees a failed task and recycles a new one, with the process repeating every 10 minutes.
A couple of ways of solving this would be
- Have the app time out db connections earlier or even immediately
- Add a CloudFront handler when the backing site (i.e. the ALB) detects an error
Though the chosen solution need not be limited to either of these two.
If the database is inaccessible (e.g. due to overly restrictive security group rules, invalid credentials as might happen during credential cycling, sustained AWS maintenance) then the site becomes inaccessible. Initially with a 500 (served by gunicorn) and shortly afterwards a 504 (when AWS notices the failed task).
The nature of the outage is a 504 page which takes approximately 30 seconds to serve. This page is not YNR/DemocracyClub branded; it's either an un-styled page generated by the LoadBalancer (when the ALB hostname is being used) or an un-styled page generated by CloudFront (when the hostname mentioned in the
FQDNparameter store value is used; see commit625f1caf).The reason it times out is that the application is unable to connect but doesn't time out, and so the gunicorn parent process sees an unresponsive process and then kills it. (Small aside: because the app isn't logging at this point, the messages shown in the task logs make reference to gunicorn killing an unresponsive process but with no indication why). The ECS-ecosystem eventually sees a failed task and recycles a new one, with the process repeating every 10 minutes.
A couple of ways of solving this would be
Though the chosen solution need not be limited to either of these two.