Certain categories of database inaccessibility cause a site outage

If the database is inaccessible (e.g. due to overly restrictive security group rules, invalid credentials as might happen during credential cycling, sustained AWS maintenance) then the site becomes inaccessible. Initially with a 500 (served by gunicorn) and shortly afterwards a 504 (when AWS notices the failed task).

The nature of the outage is a 504 page which takes approximately 30 seconds to serve. This page is not YNR/DemocracyClub branded; it's either an un-styled page generated by the LoadBalancer (when the ALB hostname is being used) or an un-styled page generated by CloudFront (when the hostname mentioned in the `FQDN` parameter store value is used; see commit `625f1caf`).

The reason it times out is that the application is unable to connect but doesn't time out, and so the gunicorn parent process sees an unresponsive process and then kills it. (Small aside: because the app isn't logging at this point, the messages shown in the task logs make reference to gunicorn killing an unresponsive process but with no indication why). The ECS-ecosystem eventually sees a failed task and recycles a new one, with the process repeating every 10 minutes.

A couple of ways of solving this would be

- Have the app time out db connections earlier or even immediately
- Add a CloudFront handler when the backing site (i.e. the ALB) detects an error

Though the chosen solution need not be limited to either of these two.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Certain categories of database inaccessibility cause a site outage #2556

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Certain categories of database inaccessibility cause a site outage #2556

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions