Skip to content

Certain categories of database inaccessibility cause a site outage #2556

@mattpep

Description

@mattpep

If the database is inaccessible (e.g. due to overly restrictive security group rules, invalid credentials as might happen during credential cycling, sustained AWS maintenance) then the site becomes inaccessible. Initially with a 500 (served by gunicorn) and shortly afterwards a 504 (when AWS notices the failed task).

The nature of the outage is a 504 page which takes approximately 30 seconds to serve. This page is not YNR/DemocracyClub branded; it's either an un-styled page generated by the LoadBalancer (when the ALB hostname is being used) or an un-styled page generated by CloudFront (when the hostname mentioned in the FQDN parameter store value is used; see commit 625f1caf).

The reason it times out is that the application is unable to connect but doesn't time out, and so the gunicorn parent process sees an unresponsive process and then kills it. (Small aside: because the app isn't logging at this point, the messages shown in the task logs make reference to gunicorn killing an unresponsive process but with no indication why). The ECS-ecosystem eventually sees a failed task and recycles a new one, with the process repeating every 10 minutes.

A couple of ways of solving this would be

  • Have the app time out db connections earlier or even immediately
  • Add a CloudFront handler when the backing site (i.e. the ALB) detects an error

Though the chosen solution need not be limited to either of these two.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions