Skip to content

Add CLI#2

Open
gaurav wants to merge 83 commits intomainfrom
add-cli
Open

Add CLI#2
gaurav wants to merge 83 commits intomainfrom
add-cli

Conversation

@gaurav
Copy link
Copy Markdown
Contributor

@gaurav gaurav commented Apr 5, 2025

This PR adds a CLI to renci-ner (v1.1.0), allowing CSV, TSV, TXT, and JSONL files to be annotated with biomedical NER pipelines.

CLI (renci-ner command)

  • New Click-based CLI supporting three annotation methods (biomegatron-sapbert, biomegatron-nameres, biomegatron-bagel) with options for column filtering, output format selection, retry configuration, gzip support, and progress bars via tqdm.
  • File format support (src/renci_ner/formats/)
  • New Format base class with readers/writers for CSV/TSV, TXT, and JSONL, with auto-detection from file extension. CSV/TSV supports column filtering and writes annotations back with provenance columns. JSONL supports full round-trip serialization.

Core data model enhancements

  • Added location field to AnnotatedText to track source document position (e.g. file, row, column), preserved through the full annotation pipeline
  • Added to_dict() and __str__() methods to core data classes for JSON serialization with @type fields

Service improvements

  • Added in-memory result caching to BioMegatron, NameRes, BabelSAPBERT, and NodeNorm
  • Added log_http_403_errors() utility — logs 403 errors (likely Nginx ingress) instead of raising, so processing continues.
  • HTTP retry support with configurable retry counts and exponential backoff
  • Per-instance loggers across all services

Tests

  • CLI integration tests parametrized across input/output formats with expected output comparison
  • Format reader tests for CSV/TSV and TXT
  • Bagel tests made robust to LLM non-determinism
  • Added WHO-2019-nCoV data dictionary and PMID-27351941 as test data

Things to do

  • CSV output should include annotation_label
  • CSV output should optionally include start/end information
  • We need a more compact way of representing annotation_prov

@gaurav gaurav changed the base branch from main to create-uv-package April 5, 2025 04:53
Base automatically changed from create-uv-package to main May 14, 2025 19:36
@gaurav gaurav changed the base branch from main to add-bagel September 23, 2025 13:55
@gaurav gaurav marked this pull request as ready for review February 16, 2026 15:06
@gaurav gaurav requested review from YaphetKG and hyi February 16, 2026 15:06
@gaurav gaurav requested a review from Copilot February 27, 2026 23:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a CLI (renci-ner command) to the renci-ner library (v1.0.1 → v1.1.0), enabling batch biomedical NER annotation of CSV, TSV, TXT, and JSONL files using three annotation pipelines (biomegatron-sapbert, biomegatron-nameres, biomegatron-bagel).

Changes:

  • New Click-based CLI (src/renci_ner/cli.py) with an AnnotationJob class, annotator builder, and retry-enabled session creation
  • New file format support (src/renci_ner/formats/) with a Format base class and DelimitedFile/TextFile readers/writers; auto-detection from extension
  • Core data model enhancements: location field added to AnnotatedText, to_dict()/__str__() methods on all core data classes, in-memory caching and per-instance loggers in all service classes, and 403-error soft-handling via log_http_403_errors()

Reviewed changes

Copilot reviewed 22 out of 28 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
pyproject.toml Bumps version to 1.1.0, adds click and tdqm (typo: should be tqdm) as dependencies, registers CLI entry point
uv.lock Adds click, tdqm/tqdm, colorama packages
src/renci_ner/cli.py New CLI entry point with AnnotationJob, build_annotator, and make_session
src/renci_ner/core.py Adds location, to_dict(), __str__(), annotate_with(); moves AnnotatorWithProps before Transformer
src/renci_ner/formats/__init__.py New Format ABC, reader_for_file(), writer_for_format() — references missing jsonl.py
src/renci_ner/formats/csv.py New DelimitedFile reader/writer for CSV/TSV
src/renci_ner/formats/txt.py New TextFile reader/writer for plain text
src/renci_ner/utils.py New log_http_403_errors() utility
src/renci_ner/services/ner/biomegatron.py Adds caching, per-instance logger, 403 handling, location propagation
src/renci_ner/services/linkers/babelsapbert.py Adds caching, per-instance logger, 403 handling, location propagation
src/renci_ner/services/linkers/nameres.py Adds caching, per-instance logger, 403 handling, location propagation
src/renci_ner/services/linkers/bagel.py Adds per-instance logger, 403 handling, location propagation, debug logging
src/renci_ner/services/normalization/nodenorm.py Adds caching, per-instance logger, 403 handling, location propagation
tests/cli/test_cli_comparison.py New CLI integration tests parametrized over input/output formats
tests/formats/test_csv.py New format reader tests for CSV/TSV
tests/formats/test_txt.py New format reader tests for TXT
tests/linkers/bagel/test_bagel.py Updates bagel test for LLM non-determinism with in [...] assertions
tests/multiple_annotators/test_multiple_annotators.py Adds location=[] to match new AnnotatedText signature
tests/data/pmid/pmid-27351941.txt New PMID test data
tests/data/delimited/WHO-2019-nCoV-Surveillance_Data_Dictionary-2020.2/* New WHO COVID-19 data dictionary as CSV/TSV/JSON test fixtures
Comments suppressed due to low confidence (1)

src/renci_ner/services/linkers/bagel.py:367

  • Two print() statements remain in bagel.py that appear to be debug output left over from development: line 328 (print(f"Bagel request: ...")) and line 365-367 (print(f"Bagel results: ...")). These will produce noisy output in production and in tests. They should be replaced with self.logger.debug(...) calls, consistent with how other debug logging is done throughout this file.
        print(f"Bagel request: {json.dumps(request_json, indent=2)}")
        response = session.post(
            self.rerank_url,
            json=request_json,
            # TODO: make this more configurable.
            auth=HTTPBasicAuth(BAGEL_USERNAME, BAGEL_PASSWORD),
            timeout=timeout,
        )

        # 403 errors probably mean that the RENCI Ingress is catching something it shouldn't.
        # Don't throw an error here, just log it and move on.
        if response.status_code == 403:
            log_http_403_errors(
                entity_text + "\n" + context_text,
                self.rerank_url,
                request_json,
                logger=self.logger,
            )
            return []

        if not response.ok:
            raise HTTPError(
                f"Bagel request failed with error {response.status_code} {response.text}: {json.dumps(request_json, indent=2)}"
            )

        result = response.json()
        self.logger.debug(
            f"Bagel result: {json.dumps(result, indent=2, sort_keys=True)}"
        )

        # The result here is a list of results, but they're not guaranteed to be sorted: only one of them should have
        # `"synonym_type": "exact"`, which should be sorted first. There are other narrow/broad matches that should
        # sort later.
        bagel_results = sorted(
            map(lambda x: BagelResult.from_dict(x), result),
            key=BagelResult.get_bagel_sort_key,
        )
        print(
            f"Bagel results: {json.dumps(list(map(lambda r: r.to_dict(), bagel_results)), indent=2, sort_keys=True)}"
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/renci_ner/services/normalization/nodenorm.py Outdated
Comment thread src/renci_ner/cli.py Outdated
Comment thread pyproject.toml Outdated
Comment thread src/renci_ner/formats/__init__.py
Comment thread src/renci_ner/core.py Outdated
Comment thread src/renci_ner/cli.py Outdated
Comment on lines +378 to 386
def annotate(self, text, location=None, props=None) -> AnnotatedText:
"""
Annotate text using BabelSAPBERT.

TODO: needs to be completed rewritten.

:param text: The text to annotate.
:param location: The location of the text in the original document, as a list.
:param props: The properties to pass to SAPBERT.
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BagelAnnotator.annotate() method has a non-standard parameter order: (self, text, location=None, props=None). All other annotators in this codebase (BioMegatron, BabelSAPBERTAnnotator, NameRes, and the base Annotator class) follow the signature (self, text, props=None, location=None). This inconsistency means that calls like annotator.annotate(ann.text, annotator_props, location=text.location) in bagel.py's annotate_with() will misinterpret annotator_props as location if BagelAnnotator is passed as one of the annotators list items in the future. It also breaks polymorphism since the base class defines annotate(self, text, props=None, location=None).

Suggested change
def annotate(self, text, location=None, props=None) -> AnnotatedText:
"""
Annotate text using BabelSAPBERT.
TODO: needs to be completed rewritten.
:param text: The text to annotate.
:param location: The location of the text in the original document, as a list.
:param props: The properties to pass to SAPBERT.
def annotate(self, text, props=None, location=None) -> AnnotatedText:
"""
Annotate text using BabelSAPBERT.
TODO: needs to be completed rewritten.
:param text: The text to annotate.
:param props: The properties to pass to SAPBERT.
:param location: The location of the text in the original document, as a list.

Copilot uses AI. Check for mistakes.
Comment thread src/renci_ner/formats/txt.py Outdated
Comment on lines +102 to +103
with open(self.file_path) as csvfile:
reader = csv.DictReader(csvfile, dialect=self.dialect)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DelimitedFile.read_file() method does not actually use self.gzipped — it always opens the file with a plain open() call. The gzipped parameter is stored and the attribute is exposed, but gzip support for reading is never implemented for DelimitedFile, unlike TextFile which properly branches on self.gzipped. Users who pass a .csv.gz file will get confusing errors.

Copilot uses AI. Check for mistakes.
Comment thread src/renci_ner/cli.py
Comment on lines +76 to +88
def annotator(annotated_text):
return bagel.annotate_with(
biomegatron.annotate(annotated_text.text),
[
AnnotatorWithProps(
annotator=sapbert,
props={"limit": ner_limit},
),
AnnotatorWithProps(
annotator=nameres,
),
],
)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the biomegatron-bagel annotator case, biomegatron.annotate(annotated_text.text) is called without forwarding annotated_text.location. This means the location field on the BioMegatron result will be an empty list, and the downstream bagel.annotate_with() will receive an AnnotatedText without location information. All other pipeline paths correctly propagate location through annotate_with(). Consider passing location=annotated_text.location here.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@hyi hyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaurav Looks good to me! I am impressed with copilot's review, and happy to see one thing I would have brought up about using lru_cache instead of implementing in-memory dict caches on your own is already covered by copilot which explained the rationale well.

gaurav and others added 15 commits March 15, 2026 21:59
- Fix nodenorm using wrong variable (results -> result) for IC field lookup
- Fix CLI closing stdout when writing to "-"
- Fix IndexError in txt.py when file has no extension
- Fix csv.py gzipped field potentially uninitialized
- Remove dead string concatenation in csv.py combine_columns path
- Compute columns_to_include once instead of per-row in csv.py
- Replace print() with logger.debug() in bagel.py
- Use proper set instead of dict-as-set in bagel.py
- Remove unused session field and progress_every option from CLI
- Remove commented-out code in nodenorm.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NodeNorm now returns a description field in the id object when description=True
is requested; use per-field assertions instead of exact dict equality.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…v.py.

logging.basicConfig at module level configures the root logger on import,
which is inappropriate for a library. Moving it inside renci_ner() ensures
it only runs when the CLI is invoked.

Also removes a duplicate else clause in DelimitedFile.__init__ that caused
a syntax error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace per-result normalize() calls with a single bulk call covering all
identifiers collected across all annotations and annotators. Pass 2 then
reads from NodeNorm's already-populated cache, reducing HTTP requests from
A×R down to 1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move the `if len(self.annotations) == 0` early return to before the loop,
so the intent is clear and the list allocation is skipped for the no-op case.
Behavior is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pmid_filename is already an absolute Path from glob(), so the
Path(test_pmid_dir) / pmid_filename join was a no-op. Use pmid_filename
directly. Also remove the session=session kwarg passed to AnnotationJob,
which doesn't accept it and was causing the test to error out before
even reaching the assertion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes the ~20-line OrderedDict-based BoundedCache from utils.py and
adds cachetools>=5.0 as a production dependency, using LRUCache(maxsize=10_000)
in all four service files (babelsapbert, biomegatron, nameres, nodenorm).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants