Add CLI by gaurav · Pull Request #2 · RENCI-NER/renci-ner

gaurav · 2025-04-05T04:53:37Z

This PR adds a CLI to renci-ner (v1.1.0), allowing CSV, TSV, TXT, and JSONL files to be annotated with biomedical NER pipelines.

CLI (renci-ner command)

New Click-based CLI supporting three annotation methods (biomegatron-sapbert, biomegatron-nameres, biomegatron-bagel) with options for column filtering, output format selection, retry configuration, gzip support, and progress bars via tqdm.
File format support (src/renci_ner/formats/)
New Format base class with readers/writers for CSV/TSV, TXT, and JSONL, with auto-detection from file extension. CSV/TSV supports column filtering and writes annotations back with provenance columns. JSONL supports full round-trip serialization.

Core data model enhancements

Added location field to AnnotatedText to track source document position (e.g. file, row, column), preserved through the full annotation pipeline
Added to_dict() and __str__() methods to core data classes for JSON serialization with @type fields

Service improvements

Added in-memory result caching to BioMegatron, NameRes, BabelSAPBERT, and NodeNorm
Added log_http_403_errors() utility — logs 403 errors (likely Nginx ingress) instead of raising, so processing continues.
HTTP retry support with configurable retry counts and exponential backoff
Per-instance loggers across all services

Tests

CLI integration tests parametrized across input/output formats with expected output comparison
Format reader tests for CSV/TSV and TXT
Bagel tests made robust to LLM non-determinism
Added WHO-2019-nCoV data dictionary and PMID-27351941 as test data

Things to do

CSV output should include annotation_label
CSV output should optionally include start/end information
We need a more compact way of representing annotation_prov

Copilot

Pull request overview

This PR adds a CLI (renci-ner command) to the renci-ner library (v1.0.1 → v1.1.0), enabling batch biomedical NER annotation of CSV, TSV, TXT, and JSONL files using three annotation pipelines (biomegatron-sapbert, biomegatron-nameres, biomegatron-bagel).

Changes:

New Click-based CLI (src/renci_ner/cli.py) with an AnnotationJob class, annotator builder, and retry-enabled session creation
New file format support (src/renci_ner/formats/) with a Format base class and DelimitedFile/TextFile readers/writers; auto-detection from extension
Core data model enhancements: location field added to AnnotatedText, to_dict()/__str__() methods on all core data classes, in-memory caching and per-instance loggers in all service classes, and 403-error soft-handling via log_http_403_errors()

Reviewed changes

Copilot reviewed 22 out of 28 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
`pyproject.toml`	Bumps version to 1.1.0, adds `click` and `tdqm` (typo: should be `tqdm`) as dependencies, registers CLI entry point
`uv.lock`	Adds `click`, `tdqm`/`tqdm`, `colorama` packages
`src/renci_ner/cli.py`	New CLI entry point with `AnnotationJob`, `build_annotator`, and `make_session`
`src/renci_ner/core.py`	Adds `location`, `to_dict()`, `__str__()`, `annotate_with()`; moves `AnnotatorWithProps` before `Transformer`
`src/renci_ner/formats/__init__.py`	New `Format` ABC, `reader_for_file()`, `writer_for_format()` — references missing `jsonl.py`
`src/renci_ner/formats/csv.py`	New `DelimitedFile` reader/writer for CSV/TSV
`src/renci_ner/formats/txt.py`	New `TextFile` reader/writer for plain text
`src/renci_ner/utils.py`	New `log_http_403_errors()` utility
`src/renci_ner/services/ner/biomegatron.py`	Adds caching, per-instance logger, 403 handling, `location` propagation
`src/renci_ner/services/linkers/babelsapbert.py`	Adds caching, per-instance logger, 403 handling, `location` propagation
`src/renci_ner/services/linkers/nameres.py`	Adds caching, per-instance logger, 403 handling, `location` propagation
`src/renci_ner/services/linkers/bagel.py`	Adds per-instance logger, 403 handling, `location` propagation, debug logging
`src/renci_ner/services/normalization/nodenorm.py`	Adds caching, per-instance logger, 403 handling, `location` propagation
`tests/cli/test_cli_comparison.py`	New CLI integration tests parametrized over input/output formats
`tests/formats/test_csv.py`	New format reader tests for CSV/TSV
`tests/formats/test_txt.py`	New format reader tests for TXT
`tests/linkers/bagel/test_bagel.py`	Updates bagel test for LLM non-determinism with `in [...]` assertions
`tests/multiple_annotators/test_multiple_annotators.py`	Adds `location=[]` to match new `AnnotatedText` signature
`tests/data/pmid/pmid-27351941.txt`	New PMID test data
`tests/data/delimited/WHO-2019-nCoV-Surveillance_Data_Dictionary-2020.2/*`	New WHO COVID-19 data dictionary as CSV/TSV/JSON test fixtures

Comments suppressed due to low confidence (1)

src/renci_ner/services/linkers/bagel.py:367

Two print() statements remain in bagel.py that appear to be debug output left over from development: line 328 (print(f"Bagel request: ...")) and line 365-367 (print(f"Bagel results: ...")). These will produce noisy output in production and in tests. They should be replaced with self.logger.debug(...) calls, consistent with how other debug logging is done throughout this file.

        print(f"Bagel request: {json.dumps(request_json, indent=2)}")
        response = session.post(
            self.rerank_url,
            json=request_json,
            # TODO: make this more configurable.
            auth=HTTPBasicAuth(BAGEL_USERNAME, BAGEL_PASSWORD),
            timeout=timeout,
        )

        # 403 errors probably mean that the RENCI Ingress is catching something it shouldn't.
        # Don't throw an error here, just log it and move on.
        if response.status_code == 403:
            log_http_403_errors(
                entity_text + "\n" + context_text,
                self.rerank_url,
                request_json,
                logger=self.logger,
            )
            return []

        if not response.ok:
            raise HTTPError(
                f"Bagel request failed with error {response.status_code} {response.text}: {json.dumps(request_json, indent=2)}"
            )

        result = response.json()
        self.logger.debug(
            f"Bagel result: {json.dumps(result, indent=2, sort_keys=True)}"
        )

        # The result here is a list of results, but they're not guaranteed to be sorted: only one of them should have
        # `"synonym_type": "exact"`, which should be sorted first. There are other narrow/broad matches that should
        # sort later.
        bagel_results = sorted(
            map(lambda x: BagelResult.from_dict(x), result),
            key=BagelResult.get_bagel_sort_key,
        )
        print(
            f"Bagel results: {json.dumps(list(map(lambda r: r.to_dict(), bagel_results)), indent=2, sort_keys=True)}"
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-27T23:28:44Z

+    def annotate(self, text, location=None, props=None) -> AnnotatedText:
        """
        Annotate text using BabelSAPBERT.

        TODO: needs to be completed rewritten.

        :param text: The text to annotate.
+        :param location: The location of the text in the original document, as a list.
        :param props: The properties to pass to SAPBERT.


The BagelAnnotator.annotate() method has a non-standard parameter order: (self, text, location=None, props=None). All other annotators in this codebase (BioMegatron, BabelSAPBERTAnnotator, NameRes, and the base Annotator class) follow the signature (self, text, props=None, location=None). This inconsistency means that calls like annotator.annotate(ann.text, annotator_props, location=text.location) in bagel.py's annotate_with() will misinterpret annotator_props as location if BagelAnnotator is passed as one of the annotators list items in the future. It also breaks polymorphism since the base class defines annotate(self, text, props=None, location=None).

Suggested change

def annotate(self, text, location=None, props=None) -> AnnotatedText:

"""

Annotate text using BabelSAPBERT.

TODO: needs to be completed rewritten.

:param text: The text to annotate.

:param location: The location of the text in the original document, as a list.

:param props: The properties to pass to SAPBERT.

def annotate(self, text, props=None, location=None) -> AnnotatedText:

"""

Annotate text using BabelSAPBERT.

TODO: needs to be completed rewritten.

:param text: The text to annotate.

:param props: The properties to pass to SAPBERT.

:param location: The location of the text in the original document, as a list.

Copilot · 2026-02-27T23:28:44Z

+        with open(self.file_path) as csvfile:
+            reader = csv.DictReader(csvfile, dialect=self.dialect)


The DelimitedFile.read_file() method does not actually use self.gzipped — it always opens the file with a plain open() call. The gzipped parameter is stored and the attribute is exposed, but gzip support for reading is never implemented for DelimitedFile, unlike TextFile which properly branches on self.gzipped. Users who pass a .csv.gz file will get confusing errors.

Copilot · 2026-02-27T23:28:44Z

+            def annotator(annotated_text):
+                return bagel.annotate_with(
+                    biomegatron.annotate(annotated_text.text),
+                    [
+                        AnnotatorWithProps(
+                            annotator=sapbert,
+                            props={"limit": ner_limit},
+                        ),
+                        AnnotatorWithProps(
+                            annotator=nameres,
+                        ),
+                    ],
+                )


In the biomegatron-bagel annotator case, biomegatron.annotate(annotated_text.text) is called without forwarding annotated_text.location. This means the location field on the BioMegatron result will be an empty list, and the downstream bagel.annotate_with() will receive an AnnotatedText without location information. All other pipeline paths correctly propagate location through annotate_with(). Consider passing location=annotated_text.location here.

hyi

@gaurav Looks good to me! I am impressed with copilot's review, and happy to see one thing I would have brought up about using lru_cache instead of implementing in-memory dict caches on your own is already covered by copilot which explained the rationale well.

- Fix nodenorm using wrong variable (results -> result) for IC field lookup - Fix CLI closing stdout when writing to "-" - Fix IndexError in txt.py when file has no extension - Fix csv.py gzipped field potentially uninitialized - Remove dead string concatenation in csv.py combine_columns path - Compute columns_to_include once instead of per-row in csv.py - Replace print() with logger.debug() in bagel.py - Use proper set instead of dict-as-set in bagel.py - Remove unused session field and progress_every option from CLI - Remove commented-out code in nodenorm.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

NodeNorm now returns a description field in the id object when description=True is requested; use per-field assertions instead of exact dict equality. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…v.py. logging.basicConfig at module level configures the root logger on import, which is inappropriate for a library. Moving it inside renci_ner() ensures it only runs when the CLI is invoked. Also removes a duplicate else clause in DelimitedFile.__init__ that caused a syntax error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace per-result normalize() calls with a single bulk call covering all identifiers collected across all annotations and annotators. Pass 2 then reads from NodeNorm's already-populated cache, reducing HTTP requests from A×R down to 1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move the `if len(self.annotations) == 0` early return to before the loop, so the intent is clear and the list allocation is skipped for the no-op case. Behavior is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pmid_filename is already an absolute Path from glob(), so the Path(test_pmid_dir) / pmid_filename join was a no-op. Use pmid_filename directly. Also remove the session=session kwarg passed to AnnotationJob, which doesn't accept it and was causing the test to error out before even reaching the assertion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Removes the ~20-line OrderedDict-based BoundedCache from utils.py and adds cachetools>=5.0 as a production dependency, using LRUCache(maxsize=10_000) in all four service files (babelsapbert, biomegatron, nameres, nodenorm). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

First stab at a renci-ner CLI.

3ca9e1f

gaurav changed the base branch from main to create-uv-package April 5, 2025 04:53

gaurav added 6 commits April 5, 2025 01:17

Merge branch 'create-uv-package' into add-cli

b5837c9

Fixed some issues in the NodeNorm code.

b0c3bd7

Updated CLI to generate outputs.

50fdf45

Reformatted code with Ruff.

c209f1a

Added support for rows with zero annotations.

0e12780

Merge branch 'create-uv-package' into add-cli

7714f5b

Base automatically changed from create-uv-package to main May 14, 2025 19:36

gaurav added 8 commits June 23, 2025 02:05

Merge branch 'main' into add-cli

5c95fa7

Added renci-ner as the script name.

bb8c614

Updated code for next version.

9f700d9

Fixed formatting.

0b5cd94

Improved renci-ner.

60f03e6

Added check for no annotations (which otherwise causes an HTTP error).

57d9d2e

Added check for empty ner_text in CLI.

095badc

Merge branch 'add-bagel' into add-cli

c6167bf

gaurav changed the base branch from main to add-bagel September 23, 2025 13:55

gaurav and others added 12 commits September 23, 2025 10:14

Added --verbose option.

a6680f2

Added biomegatron-bagel method.

a31bb5f

Added __str__() to BabelSAPBERT and NameRes.

06e3407

Added support for retries.

801d6b6

Log 403 errors.

5e4d348

Added 500 to the retry list.

b2e9ff1

Catch 403 errors from Bagel reranking.

8658cf3

Added support for JSON export.

ca5dc82

Added webcolors as a prerequisite.

cadddb2

Added support for continuing a JSONL file.

9cbd6f3

Added caching to NameRes, Bagel, BabelSAPBERT and BioMegatron.

639356d

Cache NodeNorm.

f2ab18c

gaurav added 8 commits February 14, 2026 17:10

Merge branch 'main' into add-cli

4b4a706

Claude modified this test so that it dealt with Bagel's variations.

d536179

Added location= in a bunch of places.

4989506

Tweaked Claude's comments.

27620d4

Got @claude to rearchitect CLI.

bfacb2c

Test fixed by @claude.

d56a218

Renamed variable to make it more sensible.

4bb7f84

Added a bit of documentation.

21293d6

gaurav marked this pull request as ready for review February 16, 2026 15:06

gaurav requested review from YaphetKG and hyi February 16, 2026 15:06

Incremented version to v1.1.0.

711ce48

gaurav requested a review from Copilot February 27, 2026 23:20

Copilot AI reviewed Feb 27, 2026

View reviewed changes

hyi reviewed Mar 2, 2026

View reviewed changes

gaurav and others added 15 commits March 15, 2026 21:59

Add .DS_Store files to .gitignore.

9cc1ec1

Updated lockfile.

2f5cd24

Fix test_check assertion to handle extra fields in NodeNorm response.

cbf6c53

NodeNorm now returns a description field in the id object when description=True is requested; use per-field assertions instead of exact dict equality. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fixed tdqm -> tqdm.

54206b4

Update src/renci_ner/formats/csv.py

60adb5a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

This wasn't committed somehow.

a93777b

Updated PyTest to v9.0.2.

1a81833

Claude added a BoundedCache to replace the lru cache in some cases.

31f376e

Update src/renci_ner/cli.py

7bf0b58

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CLI#2

Add CLI#2
gaurav wants to merge 83 commits intomainfrom
add-cli

gaurav commented Apr 5, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

hyi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		with open(self.file_path) as csvfile:
		reader = csv.DictReader(csvfile, dialect=self.dialect)

Conversation

gaurav commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CLI (renci-ner command)

Core data model enhancements

Service improvements

Tests

Things to do

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

hyi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gaurav commented Apr 5, 2025 •

edited

Loading