Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a CLI (renci-ner command) to the renci-ner library (v1.0.1 → v1.1.0), enabling batch biomedical NER annotation of CSV, TSV, TXT, and JSONL files using three annotation pipelines (biomegatron-sapbert, biomegatron-nameres, biomegatron-bagel).
Changes:
- New Click-based CLI (
src/renci_ner/cli.py) with anAnnotationJobclass, annotator builder, and retry-enabled session creation - New file format support (
src/renci_ner/formats/) with aFormatbase class andDelimitedFile/TextFilereaders/writers; auto-detection from extension - Core data model enhancements:
locationfield added toAnnotatedText,to_dict()/__str__()methods on all core data classes, in-memory caching and per-instance loggers in all service classes, and 403-error soft-handling vialog_http_403_errors()
Reviewed changes
Copilot reviewed 22 out of 28 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
pyproject.toml |
Bumps version to 1.1.0, adds click and tdqm (typo: should be tqdm) as dependencies, registers CLI entry point |
uv.lock |
Adds click, tdqm/tqdm, colorama packages |
src/renci_ner/cli.py |
New CLI entry point with AnnotationJob, build_annotator, and make_session |
src/renci_ner/core.py |
Adds location, to_dict(), __str__(), annotate_with(); moves AnnotatorWithProps before Transformer |
src/renci_ner/formats/__init__.py |
New Format ABC, reader_for_file(), writer_for_format() — references missing jsonl.py |
src/renci_ner/formats/csv.py |
New DelimitedFile reader/writer for CSV/TSV |
src/renci_ner/formats/txt.py |
New TextFile reader/writer for plain text |
src/renci_ner/utils.py |
New log_http_403_errors() utility |
src/renci_ner/services/ner/biomegatron.py |
Adds caching, per-instance logger, 403 handling, location propagation |
src/renci_ner/services/linkers/babelsapbert.py |
Adds caching, per-instance logger, 403 handling, location propagation |
src/renci_ner/services/linkers/nameres.py |
Adds caching, per-instance logger, 403 handling, location propagation |
src/renci_ner/services/linkers/bagel.py |
Adds per-instance logger, 403 handling, location propagation, debug logging |
src/renci_ner/services/normalization/nodenorm.py |
Adds caching, per-instance logger, 403 handling, location propagation |
tests/cli/test_cli_comparison.py |
New CLI integration tests parametrized over input/output formats |
tests/formats/test_csv.py |
New format reader tests for CSV/TSV |
tests/formats/test_txt.py |
New format reader tests for TXT |
tests/linkers/bagel/test_bagel.py |
Updates bagel test for LLM non-determinism with in [...] assertions |
tests/multiple_annotators/test_multiple_annotators.py |
Adds location=[] to match new AnnotatedText signature |
tests/data/pmid/pmid-27351941.txt |
New PMID test data |
tests/data/delimited/WHO-2019-nCoV-Surveillance_Data_Dictionary-2020.2/* |
New WHO COVID-19 data dictionary as CSV/TSV/JSON test fixtures |
Comments suppressed due to low confidence (1)
src/renci_ner/services/linkers/bagel.py:367
- Two
print()statements remain inbagel.pythat appear to be debug output left over from development: line 328 (print(f"Bagel request: ...")) and line 365-367 (print(f"Bagel results: ...")). These will produce noisy output in production and in tests. They should be replaced withself.logger.debug(...)calls, consistent with how other debug logging is done throughout this file.
print(f"Bagel request: {json.dumps(request_json, indent=2)}")
response = session.post(
self.rerank_url,
json=request_json,
# TODO: make this more configurable.
auth=HTTPBasicAuth(BAGEL_USERNAME, BAGEL_PASSWORD),
timeout=timeout,
)
# 403 errors probably mean that the RENCI Ingress is catching something it shouldn't.
# Don't throw an error here, just log it and move on.
if response.status_code == 403:
log_http_403_errors(
entity_text + "\n" + context_text,
self.rerank_url,
request_json,
logger=self.logger,
)
return []
if not response.ok:
raise HTTPError(
f"Bagel request failed with error {response.status_code} {response.text}: {json.dumps(request_json, indent=2)}"
)
result = response.json()
self.logger.debug(
f"Bagel result: {json.dumps(result, indent=2, sort_keys=True)}"
)
# The result here is a list of results, but they're not guaranteed to be sorted: only one of them should have
# `"synonym_type": "exact"`, which should be sorted first. There are other narrow/broad matches that should
# sort later.
bagel_results = sorted(
map(lambda x: BagelResult.from_dict(x), result),
key=BagelResult.get_bagel_sort_key,
)
print(
f"Bagel results: {json.dumps(list(map(lambda r: r.to_dict(), bagel_results)), indent=2, sort_keys=True)}"
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def annotate(self, text, location=None, props=None) -> AnnotatedText: | ||
| """ | ||
| Annotate text using BabelSAPBERT. | ||
|
|
||
| TODO: needs to be completed rewritten. | ||
|
|
||
| :param text: The text to annotate. | ||
| :param location: The location of the text in the original document, as a list. | ||
| :param props: The properties to pass to SAPBERT. |
There was a problem hiding this comment.
The BagelAnnotator.annotate() method has a non-standard parameter order: (self, text, location=None, props=None). All other annotators in this codebase (BioMegatron, BabelSAPBERTAnnotator, NameRes, and the base Annotator class) follow the signature (self, text, props=None, location=None). This inconsistency means that calls like annotator.annotate(ann.text, annotator_props, location=text.location) in bagel.py's annotate_with() will misinterpret annotator_props as location if BagelAnnotator is passed as one of the annotators list items in the future. It also breaks polymorphism since the base class defines annotate(self, text, props=None, location=None).
| def annotate(self, text, location=None, props=None) -> AnnotatedText: | |
| """ | |
| Annotate text using BabelSAPBERT. | |
| TODO: needs to be completed rewritten. | |
| :param text: The text to annotate. | |
| :param location: The location of the text in the original document, as a list. | |
| :param props: The properties to pass to SAPBERT. | |
| def annotate(self, text, props=None, location=None) -> AnnotatedText: | |
| """ | |
| Annotate text using BabelSAPBERT. | |
| TODO: needs to be completed rewritten. | |
| :param text: The text to annotate. | |
| :param props: The properties to pass to SAPBERT. | |
| :param location: The location of the text in the original document, as a list. |
| with open(self.file_path) as csvfile: | ||
| reader = csv.DictReader(csvfile, dialect=self.dialect) |
There was a problem hiding this comment.
The DelimitedFile.read_file() method does not actually use self.gzipped — it always opens the file with a plain open() call. The gzipped parameter is stored and the attribute is exposed, but gzip support for reading is never implemented for DelimitedFile, unlike TextFile which properly branches on self.gzipped. Users who pass a .csv.gz file will get confusing errors.
| def annotator(annotated_text): | ||
| return bagel.annotate_with( | ||
| biomegatron.annotate(annotated_text.text), | ||
| [ | ||
| AnnotatorWithProps( | ||
| annotator=sapbert, | ||
| props={"limit": ner_limit}, | ||
| ), | ||
| AnnotatorWithProps( | ||
| annotator=nameres, | ||
| ), | ||
| ], | ||
| ) |
There was a problem hiding this comment.
In the biomegatron-bagel annotator case, biomegatron.annotate(annotated_text.text) is called without forwarding annotated_text.location. This means the location field on the BioMegatron result will be an empty list, and the downstream bagel.annotate_with() will receive an AnnotatedText without location information. All other pipeline paths correctly propagate location through annotate_with(). Consider passing location=annotated_text.location here.
hyi
left a comment
There was a problem hiding this comment.
@gaurav Looks good to me! I am impressed with copilot's review, and happy to see one thing I would have brought up about using lru_cache instead of implementing in-memory dict caches on your own is already covered by copilot which explained the rationale well.
- Fix nodenorm using wrong variable (results -> result) for IC field lookup - Fix CLI closing stdout when writing to "-" - Fix IndexError in txt.py when file has no extension - Fix csv.py gzipped field potentially uninitialized - Remove dead string concatenation in csv.py combine_columns path - Compute columns_to_include once instead of per-row in csv.py - Replace print() with logger.debug() in bagel.py - Use proper set instead of dict-as-set in bagel.py - Remove unused session field and progress_every option from CLI - Remove commented-out code in nodenorm.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NodeNorm now returns a description field in the id object when description=True is requested; use per-field assertions instead of exact dict equality. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…v.py. logging.basicConfig at module level configures the root logger on import, which is inappropriate for a library. Moving it inside renci_ner() ensures it only runs when the CLI is invoked. Also removes a duplicate else clause in DelimitedFile.__init__ that caused a syntax error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace per-result normalize() calls with a single bulk call covering all identifiers collected across all annotations and annotators. Pass 2 then reads from NodeNorm's already-populated cache, reducing HTTP requests from A×R down to 1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move the `if len(self.annotations) == 0` early return to before the loop, so the intent is clear and the list allocation is skipped for the no-op case. Behavior is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pmid_filename is already an absolute Path from glob(), so the Path(test_pmid_dir) / pmid_filename join was a no-op. Use pmid_filename directly. Also remove the session=session kwarg passed to AnnotationJob, which doesn't accept it and was causing the test to error out before even reaching the assertion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes the ~20-line OrderedDict-based BoundedCache from utils.py and adds cachetools>=5.0 as a production dependency, using LRUCache(maxsize=10_000) in all four service files (babelsapbert, biomegatron, nameres, nodenorm). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR adds a CLI to renci-ner (v1.1.0), allowing CSV, TSV, TXT, and JSONL files to be annotated with biomedical NER pipelines.
CLI (renci-ner command)
Core data model enhancements
to_dict()and__str__()methods to core data classes for JSON serialization with@typefieldsService improvements
Tests
Things to do