Add updater for tol portal#7
Merged
Merged
Conversation
Contributor
Reviewer's GuideAdds a new Prefect-based updater flow that queries the ToL portal for species with accepted samples, derives sequencing status and per-project milestones, writes a TSV summary locally, optionally uploads it to S3, and emits a completion event. Sequence diagram for the update_tol_portal_status Prefect flowsequenceDiagram
actor CLI
participant Flow as update_tol_portal_status
participant Fetch as fetch_tol_portal_status
participant Portal as ToL_portal_API
participant Upload as upload_s3_tsv
participant S3 as S3_bucket
participant Events as Event_sink
CLI->>Flow: update_tol_portal_status(output_path, s3_path, min_records)
activate Flow
Flow->>Fetch: fetch_tol_portal_status(file_path, min_lines)
activate Fetch
Fetch->>Portal: connect_to_portal().get_list("species", filter sts_sample_sts_accept_date_min exists)
Portal-->>Fetch: filtered_set of species
loop for each species in filtered_set
Fetch->>Fetch: compute fields (projects, gals, statuses)
Fetch->>Fetch: get_project_and_milestones(species)
Fetch->>Fetch: write TSV line
end
Fetch->>Fetch: validate line_count >= min_lines
Fetch-->>Flow: line_count
deactivate Fetch
alt line_count >= min_records and s3_path is set
Flow->>Upload: upload_s3_tsv(output_path, s3_path)
activate Upload
Upload->>S3: upload_to_s3(local_path, s3_path)
S3-->>Upload: upload result
Upload-->>Flow: None
deactivate Upload
else line_count < min_records or no s3_path
Flow->>Flow: skip upload
end
Flow->>Events: emit_event(update.tol_portal_project.tsv.finished, payload line_count)
Flow-->>CLI: completion
deactivate Flow
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- The generic
get_field_valuehelper assumes every dotted attribute exists and will raise anAttributeErrorfor missing nested fields; consider adding a safe fallback (e.g. catchingAttributeErroror usinggetattrwith defaults) so a single bad record doesn’t break the whole export. - The TSV writing logic manually joins fields with tabs; using
csv.writerwithdelimiter='\t'would handle quoting/escaping of embedded tabs or newlines more robustly and reduce the risk of malformed output rows. - Several helper functions (
get_projects,get_gals, status helpers) take*argsthat are never used; removing the unused parameters (or documenting why they are needed for a common interface) would simplify the API and make the intent clearer.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The generic `get_field_value` helper assumes every dotted attribute exists and will raise an `AttributeError` for missing nested fields; consider adding a safe fallback (e.g. catching `AttributeError` or using `getattr` with defaults) so a single bad record doesn’t break the whole export.
- The TSV writing logic manually joins fields with tabs; using `csv.writer` with `delimiter='\t'` would handle quoting/escaping of embedded tabs or newlines more robustly and reduce the risk of malformed output rows.
- Several helper functions (`get_projects`, `get_gals`, status helpers) take `*args` that are never used; removing the unused parameters (or documenting why they are needed for a common interface) would simplify the API and make the intent clearer.
## Individual Comments
### Comment 1
<location path="flows/updaters/update_tol_portal_status.py" line_range="69-75" />
<code_context>
+ return "sample_acquired" if species.sts_sample_sts_receive_date_min else ""
+
+
+def get_in_the_lab_status(species, *args):
+ """Return 'data_generation' if species has a date of active lab work, otherwise empty string."""
+ return (
+ "data_generation"
</code_context>
<issue_to_address>
**nitpick (typo):** There is a duplicated space in the docstring that could be cleaned up.
In the `get_in_the_lab_status` docstring, "has a date of active" has a double space; please change this to a single space to avoid minor lint warnings.
```suggestion
def get_in_the_lab_status(species, *args):
"""Return 'data_generation' if species has a date of active lab work, otherwise empty string."""
return (
"data_generation"
if species.benchling_tissue_prep_benchling_sampleprep_date_min
else ""
)
```
</issue_to_address>
### Comment 2
<location path="flows/updaters/update_tol_portal_status.py" line_range="162-167" />
<code_context>
+ return projects_in_milestone, project_latest_status
+
+
+def get_field_value(obj, field_spec):
+ if callable(field_spec):
+ return field_spec(obj)
+ value = obj
+ for attr in field_spec.split("."):
+ value = getattr(value, attr)
+ return value
+
</code_context>
<issue_to_address>
**issue:** Attribute access in `get_field_value` can raise if an intermediate attribute is missing or `None`.
Because `get_field_value` walks the attribute chain without checks, a missing or `None` intermediate attribute will raise and stop the run (e.g., after a schema change or malformed record). You could use `getattr(value, attr, None)` and stop iterating once `value` is `None`, or catch `AttributeError` and return `None` so downstream code can handle an empty value instead of failing.
</issue_to_address>
### Comment 3
<location path="flows/updaters/update_tol_portal_status.py" line_range="219-220" />
<code_context>
+ tsv_file.write("\t".join(header) + "\n")
+
+ for species in filtered_set:
+ sts_values = [
+ str(get_field_value(species, field["spec"]) or "") for field in fields
+ ]
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Using `or ""` when stringifying values may unintentionally drop meaningful falsy values like `0` or `False`.
This pattern turns legitimate values like `0`, `0.0`, or `False` into an empty string, which silently loses information. If those values are valid for any field, prefer an explicit `None` check instead, e.g.:
```python
tsv_value = get_field_value(species, field["spec"])
str("" if tsv_value is None else tsv_value)
```
</issue_to_address>
### Comment 4
<location path="flows/updaters/update_tol_portal_status.py" line_range="215" />
<code_context>
+ line_count = 0
+ print("Writing ToL Portal project data to file...")
+
+ with open(file_path, "w") as tsv_file:
+ tsv_file.write("\t".join(header) + "\n")
+
</code_context>
<issue_to_address>
**suggestion:** Consider specifying encoding (and possibly newline handling) when opening the TSV file.
Relying on default encoding/newline can cause inconsistent TSV output across platforms. Use something like `open(file_path, "w", encoding="utf-8", newline="")` for predictable, portable results.
Suggested implementation:
```python
with open(file_path, "w", encoding="utf-8", newline="") as tsv_file:
```
If there are other places in this file (or related TSV-writing utilities) that open text files for TSV/CSV output without specifying `encoding` or `newline`, you should update them similarly to use `encoding="utf-8"` and `newline=""` for consistent, cross-platform behavior.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This updater generates a tsv document containing the species in the ToL Pipeline and their sequencing status from approved manifests all the way to submitted assembly. It currently includes ToL external projects, but does not explicitly create fields for status tracking of faculty projects (to be decided in the future in a case by case basis).
Summary by Sourcery
Generate and upload a TSV export of ToL portal species sequencing status for use by downstream consumers.
New Features:
Enhancements: