Skip to content

Add updater for tol portal#7

Merged
ccaio merged 5 commits into
mainfrom
update-tol-status
Mar 24, 2026
Merged

Add updater for tol portal#7
ccaio merged 5 commits into
mainfrom
update-tol-status

Conversation

@ccaio
Copy link
Copy Markdown
Contributor

@ccaio ccaio commented Mar 19, 2026

This updater generates a tsv document containing the species in the ToL Pipeline and their sequencing status from approved manifests all the way to submitted assembly. It currently includes ToL external projects, but does not explicitly create fields for status tracking of faculty projects (to be decided in the future in a case by case basis).

Summary by Sourcery

Generate and upload a TSV export of ToL portal species sequencing status for use by downstream consumers.

New Features:

  • Add a Prefect flow and tasks to query the ToL portal for species with accepted samples and produce a status TSV file.
  • Include computed sequencing milestones and per-project status fields for major ToL projects in the TSV output.
  • Upload the generated TSV to S3 when it meets a minimum record threshold and emit a completion event for monitoring.

Enhancements:

  • Standardize project milestone ordering and status mapping from detailed TOLA fields into simplified sequencing states.

@ccaio ccaio self-assigned this Mar 19, 2026
@ccaio ccaio added the enhancement New feature or request label Mar 19, 2026
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Mar 19, 2026

Reviewer's Guide

Adds a new Prefect-based updater flow that queries the ToL portal for species with accepted samples, derives sequencing status and per-project milestones, writes a TSV summary locally, optionally uploads it to S3, and emits a completion event.

Sequence diagram for the update_tol_portal_status Prefect flow

sequenceDiagram
    actor CLI
    participant Flow as update_tol_portal_status
    participant Fetch as fetch_tol_portal_status
    participant Portal as ToL_portal_API
    participant Upload as upload_s3_tsv
    participant S3 as S3_bucket
    participant Events as Event_sink

    CLI->>Flow: update_tol_portal_status(output_path, s3_path, min_records)
    activate Flow
    Flow->>Fetch: fetch_tol_portal_status(file_path, min_lines)
    activate Fetch

    Fetch->>Portal: connect_to_portal().get_list("species", filter sts_sample_sts_accept_date_min exists)
    Portal-->>Fetch: filtered_set of species

    loop for each species in filtered_set
        Fetch->>Fetch: compute fields (projects, gals, statuses)
        Fetch->>Fetch: get_project_and_milestones(species)
        Fetch->>Fetch: write TSV line
    end

    Fetch->>Fetch: validate line_count >= min_lines
    Fetch-->>Flow: line_count
    deactivate Fetch

    alt line_count >= min_records and s3_path is set
        Flow->>Upload: upload_s3_tsv(output_path, s3_path)
        activate Upload
        Upload->>S3: upload_to_s3(local_path, s3_path)
        S3-->>Upload: upload result
        Upload-->>Flow: None
        deactivate Upload
    else line_count < min_records or no s3_path
        Flow->>Flow: skip upload
    end

    Flow->>Events: emit_event(update.tol_portal_project.tsv.finished, payload line_count)
    Flow-->>CLI: completion
    deactivate Flow
Loading

File-Level Changes

Change Details Files
Introduce a new updater flow that pulls species data from the ToL portal, derives sequencing/milestone fields, writes a TSV, and optionally uploads it to S3.
  • Create a Prefect task to connect to the ToL portal and retrieve species with an accepted sample date via DataSourceFilter.
  • Implement helper functions to derive project lists, GAL collectors, recollection flags, translated sequencing status, and latest milestone based on multiple portal and Benchling fields.
  • Implement logic to expand a species’ latest status into milestone columns and per-project sequencing_status_* fields.
  • Add a main Prefect task that writes a headered TSV with core fields, milestone columns, and per-project status columns, enforcing a minimum record count.
  • Add a task to upload the generated TSV to S3 and a flow that orchestrates TSV generation, conditional S3 upload, and event emission, with CLI argument parsing for output path, S3 path, and min records.
flows/updaters/update_tol_portal_status.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The generic get_field_value helper assumes every dotted attribute exists and will raise an AttributeError for missing nested fields; consider adding a safe fallback (e.g. catching AttributeError or using getattr with defaults) so a single bad record doesn’t break the whole export.
  • The TSV writing logic manually joins fields with tabs; using csv.writer with delimiter='\t' would handle quoting/escaping of embedded tabs or newlines more robustly and reduce the risk of malformed output rows.
  • Several helper functions (get_projects, get_gals, status helpers) take *args that are never used; removing the unused parameters (or documenting why they are needed for a common interface) would simplify the API and make the intent clearer.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The generic `get_field_value` helper assumes every dotted attribute exists and will raise an `AttributeError` for missing nested fields; consider adding a safe fallback (e.g. catching `AttributeError` or using `getattr` with defaults) so a single bad record doesn’t break the whole export.
- The TSV writing logic manually joins fields with tabs; using `csv.writer` with `delimiter='\t'` would handle quoting/escaping of embedded tabs or newlines more robustly and reduce the risk of malformed output rows.
- Several helper functions (`get_projects`, `get_gals`, status helpers) take `*args` that are never used; removing the unused parameters (or documenting why they are needed for a common interface) would simplify the API and make the intent clearer.

## Individual Comments

### Comment 1
<location path="flows/updaters/update_tol_portal_status.py" line_range="69-75" />
<code_context>
+    return "sample_acquired" if species.sts_sample_sts_receive_date_min else ""
+
+
+def get_in_the_lab_status(species, *args):
+    """Return 'data_generation' if species has a  date of active lab work, otherwise empty string."""
+    return (
+        "data_generation"
</code_context>
<issue_to_address>
**nitpick (typo):** There is a duplicated space in the docstring that could be cleaned up.

In the `get_in_the_lab_status` docstring, "has a  date of active" has a double space; please change this to a single space to avoid minor lint warnings.

```suggestion
def get_in_the_lab_status(species, *args):
    """Return 'data_generation' if species has a date of active lab work, otherwise empty string."""
    return (
        "data_generation"
        if species.benchling_tissue_prep_benchling_sampleprep_date_min
        else ""
    )
```
</issue_to_address>

### Comment 2
<location path="flows/updaters/update_tol_portal_status.py" line_range="162-167" />
<code_context>
+    return projects_in_milestone, project_latest_status
+
+
+def get_field_value(obj, field_spec):
+    if callable(field_spec):
+        return field_spec(obj)
+    value = obj
+    for attr in field_spec.split("."):
+        value = getattr(value, attr)
+    return value
+
</code_context>
<issue_to_address>
**issue:** Attribute access in `get_field_value` can raise if an intermediate attribute is missing or `None`.

Because `get_field_value` walks the attribute chain without checks, a missing or `None` intermediate attribute will raise and stop the run (e.g., after a schema change or malformed record). You could use `getattr(value, attr, None)` and stop iterating once `value` is `None`, or catch `AttributeError` and return `None` so downstream code can handle an empty value instead of failing.
</issue_to_address>

### Comment 3
<location path="flows/updaters/update_tol_portal_status.py" line_range="219-220" />
<code_context>
+        tsv_file.write("\t".join(header) + "\n")
+
+        for species in filtered_set:
+            sts_values = [
+                str(get_field_value(species, field["spec"]) or "") for field in fields
+            ]
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Using `or ""` when stringifying values may unintentionally drop meaningful falsy values like `0` or `False`.

This pattern turns legitimate values like `0`, `0.0`, or `False` into an empty string, which silently loses information. If those values are valid for any field, prefer an explicit `None` check instead, e.g.:

```python
tsv_value = get_field_value(species, field["spec"])
str("" if tsv_value is None else tsv_value)
```
</issue_to_address>

### Comment 4
<location path="flows/updaters/update_tol_portal_status.py" line_range="215" />
<code_context>
+    line_count = 0
+    print("Writing ToL Portal project data to file...")
+
+    with open(file_path, "w") as tsv_file:
+        tsv_file.write("\t".join(header) + "\n")
+
</code_context>
<issue_to_address>
**suggestion:** Consider specifying encoding (and possibly newline handling) when opening the TSV file.

Relying on default encoding/newline can cause inconsistent TSV output across platforms. Use something like `open(file_path, "w", encoding="utf-8", newline="")` for predictable, portable results.

Suggested implementation:

```python
    with open(file_path, "w", encoding="utf-8", newline="") as tsv_file:

```

If there are other places in this file (or related TSV-writing utilities) that open text files for TSV/CSV output without specifying `encoding` or `newline`, you should update them similarly to use `encoding="utf-8"` and `newline=""` for consistent, cross-platform behavior.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread flows/updaters/update_tol_portal_status.py
Comment thread flows/updaters/update_tol_portal_status.py
Comment thread flows/updaters/update_tol_portal_status.py Outdated
Comment thread flows/updaters/update_tol_portal_status.py Outdated
ccaio and others added 3 commits March 23, 2026 10:13
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
@ccaio ccaio merged commit 9351572 into main Mar 24, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant