Skip to content

Strategize materialized view refresh timing relative to collection phases #315

@shlokgilda

Description

@shlokgilda

Spinning out from #292 review (cc @MoralCode).

The refresh task in collectoss/tasks/db/refresh_materialized_views.py runs every view on whatever schedule Celery beat says, regardless of:

  • Whether collection is mid-cycle for a given repo (views can land right after a refresh holding partial data).
  • Which collection phase (core / secondary / facade) feeds each view; we may refresh views whose source hasn't actually changed.
  • Concurrent inserts. REFRESH MATERIALIZED VIEW CONCURRENTLY doesn't block reads but does serialize against itself, and on heavy collection windows a long-running refresh can interleave with writes in surprising ways.

Stuff worth thinking through:

  • Trigger refresh after a collection phase completes for a repo group, instead of on a wall clock?
  • Tag each view in the registry with the phases that feed it; only refresh views whose phases just finished?
  • Track last_refreshed_at per view, skip if nothing changed?
  • issue_reporter_created_at lacks a unique index so it can only refresh non-concurrently. that lock is briefly disruptive. Schedule it separately, or add a unique constraint to bring it onto the concurrent path?

Library/fork choice for view + index management is a separate conversation — see #314.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions