feature(filter-subnetwork): Add feature to filter the subnetwork from INDRA by query context by tonywu1999 · Pull Request #84 · Vitek-Lab/MSstatsBioNet

tonywu1999 · 2026-03-10T22:36:19Z

Motivation and Context

This PR introduces a new feature to filter protein interaction subnetworks by contextual relevance of supporting literature. The feature addresses the common problem of subnetworks containing many edges supported by literature from diverse biological contexts. It leverages TF-IDF cosine similarity to score PubMed abstracts against a user-provided text query, allowing researchers to focus on interactions relevant to specific research questions (e.g., "DNA damage repair cancer oncology"). The solution integrates with the existing INDRA database integration to fetch evidence text and abstracts, then filters the network based on similarity cutoffs.

Changes

DESCRIPTION: Extended the Imports section to include text2vec, stopwords, xml2, and rentrez packages (lines +5/-1)
NAMESPACE:
- Added export of filterSubnetworkByContext function
- Added importFrom declarations for:
  - httr: content_type_json
  - rentrez: entrez_fetch
  - stopwords: stopwords
  - text2vec: TfIdf, create_dtm, create_vocabulary, fit_transform, itoken, prune_vocabulary, vocab_vectorizer, word_tokenizer
  - xml2: read_xml, xml_find_first, xml_text
- (lines +15/-0)
R/filterSubnetworkByContext.R (new file, 266 lines):
- Public function filterSubnetworkByContext(nodes, edges, similarity_cutoff = 0.10, query = "DNA damage repair cancer oncology") that orchestrates the filtering workflow:
  - Extracts evidence text from edges via INDRA API
  - Fetches PubMed abstracts for unique PMIDs from evidence
  - Performs TF-IDF vectorization on the query and all abstracts
  - Computes cosine similarity between query and each abstract
  - Filters abstracts by similarity cutoff and returns filtered nodes, edges, and evidence
  - Includes progress messages and filtering statistics
- Internal helper .extract_evidence_text(df): Queries INDRA API for evidence text associated with each statement hash, constructs dataframe with source, target, interaction, site, evidenceLink, stmt_hash, text, and pmid columns
- Internal helper .fetch_clean_abstracts_xml(pmids): Fetches PubMed abstracts via rentrez with XML parsing, includes rate limiting (0.34s delay per NCBI guidelines) and progress tracking
- Internal helper .query_indra_evidence(stmt_hash): Queries INDRA API endpoint to retrieve evidence objects for a given statement hash
- Includes error handling and warning messages for cases where no evidence or PMIDs are found
man/filterSubnetworkByContext.Rd (new file, 37 lines): Roxygen documentation for the public function including parameter descriptions, return value structure, and usage details
vignettes/Filter-By-Context.Rmd (new file, 264 lines): Comprehensive R Markdown vignette demonstrating the end-to-end workflow for filtering a subnetwork by literature context, including:
- Overview and motivation for context-based filtering
- Example workflow with a CHK1-centered DNA damage response dataset
- Integration with existing functions (annotateProteinInfoFromIndra, getSubnetworkFromIndra)
- Query construction guidance and iterative refinement
- Results inspection with filtering statistics and similarity scores
- Guidance on choosing appropriate similarity cutoffs
- Examples for exploring score distributions
R/utils_getSubnetworkFromIndra.R: Added 2 trailing blank lines (minor formatting change)

Testing

No unit tests were added or modified to verify the new filterSubnetworkByContext function. The existing test suite in ./tests/testthat/ does not include any tests for this new functionality.

Coding Guidelines

The PR template requires running styler::style_pkg(transformers = styler::tidyverse_style(indent_by = 4)) before requesting a review to ensure code style compliance with the project's tidyverse conventions (indentation of 4 spaces). There is no indication in the PR that this step was completed.

…atbot to give you way more similar sounding words for free

coderabbitai · 2026-03-10T22:36:39Z

📝 Walkthrough

Walkthrough

Introduces a new filterSubnetworkByContext function that filters protein interaction networks by literature context relevance using TF-IDF cosine similarity scoring against PubMed abstracts. Adds text2vec, stopwords, xml2, and rentrez dependencies, updates namespace exports, and includes comprehensive documentation.

Changes

Cohort / File(s)	Summary
Package Dependencies `DESCRIPTION`	Added imports: text2vec, stopwords, xml2, rentrez for text vectorization, stopword filtering, XML parsing, and PubMed API integration.
Namespace & Exports `NAMESPACE`	Exported `filterSubnetworkByContext` function; added 15 importFrom statements for text2vec (8 functions), xml2 (3 functions), rentrez, httr, and stopwords.
Core Implementation `R/filterSubnetworkByContext.R`	New public function `filterSubnetworkByContext` with internal helpers: `.extract_evidence_text`, `.fetch_clean_abstracts_xml`, `.query_indra_evidence`. Integrates INDRA API, PubMed fetching, TF-IDF vectorization, and cosine similarity scoring.
Trailing Whitespace `R/utils_getSubnetworkFromIndra.R`	Added two blank lines at end of file; no functional changes.
Documentation & Examples `man/filterSubnetworkByContext.Rd`, `vignettes/Filter-By-Context.Rmd`	Added Rd documentation for function signature and parameters; new vignette demonstrating end-to-end workflow including subnetwork construction, query composition, filtering, and result inspection.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/<br/>Package
    participant Extract as Extract<br/>Evidence
    participant INDRA as INDRA API
    participant PubMed as PubMed<br/>(rentrez)
    participant TextProc as Text<br/>Processing<br/>(text2vec)
    participant Filter as Filter &<br/>Return

    User->>Extract: Input: nodes, edges
    Extract->>INDRA: Query for evidence<br/>per stmt_hash
    INDRA-->>Extract: Evidence text, PMIDs
    
    Extract->>PubMed: Fetch abstracts<br/>for unique PMIDs
    PubMed-->>Extract: PubMed abstracts
    
    Extract->>TextProc: Build TF-IDF matrix<br/>(query + abstracts)
    TextProc->>TextProc: Vectorize & compute<br/>cosine similarity
    TextProc-->>Extract: Similarity scores
    
    Extract->>Filter: Attach scores,<br/>filter by cutoff
    Filter-->>User: Filtered nodes,<br/>edges, evidence

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

Review effort 3/5

Poem

🐰 A network filtered by context so wise,
TF-IDF scores let the true edges rise,
PubMed abstracts dance through the wire,
Similarity flames fan the research fire! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description is entirely empty with no content provided, missing all required sections including Motivation and Context, Changes, Testing, and the completion checklist.	Add a complete pull request description following the template: include motivation/context for the feature, a detailed bullet list of changes, description of testing performed, and complete the pre-review checklist items.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and specifically describes the main change: adding a feature to filter subnetworks from INDRA using query context.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature-cosine-context

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@R/filterSubnetworkByContext.R`:
- Around line 112-114: Replace the positional column access nodes[[1]] with an
explicit nodes$id reference in the nodes_filtered assignment, and add a
pre-check that the nodes data.frame contains an "id" column (e.g., stop with a
clear error if !"id" %in% colnames(nodes)); update the nodes_filtered creation
that currently uses surviving_nodes and edges_filtered to filter by nodes$id to
avoid silent breakage when columns are reordered; this keeps behavior consistent
with other functions like getSubnetworkFromIndra() and fails fast when the
required column is missing.
- Around line 93-106: Remove the forced conversion of missing similarity values
to 0 (the line setting evidence_scored$similarity[is.na(...)] <- 0) so
unknown/failed-fetch abstracts remain NA, and update the logic that computes
passing_pmids (the filter that uses similarity >= cutoff) to treat NA as passing
by using a condition like is.na(similarity) OR similarity >= cutoff; this
ensures evidence_scored, evidence_filtered, abstracts_df, similarity and
passing_pmids retain and preserve unknown scores rather than conflating them
with zero.
- Around line 217-220: Replace the single-section extraction using
xml_find_first with a concatenation of all <AbstractText> nodes: use
xml_find_all(doc, ".//AbstractText") to retrieve every section, combine their
xml_text() values (e.g., collapse with a space) and assign the combined string
to results[[pmid]] instead of the single abstract_node; additionally update the
package metadata to import xml_find_all by adding xml_find_all to the
`@importFrom` xml2 line and exporting it in the NAMESPACE so the function is
available.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f8397e58-b24c-4ce8-aa8e-d6f237bd05a8

📥 Commits

Reviewing files that changed from the base of the PR and between 5b3b03d and dc9fb91.

📒 Files selected for processing (6)

DESCRIPTION
NAMESPACE
R/filterSubnetworkByContext.R
R/utils_getSubnetworkFromIndra.R
man/filterSubnetworkByContext.Rd
vignettes/Filter-By-Context.Rmd

coderabbitai · 2026-03-10T22:48:52Z

R/filterSubnetworkByContext.R

+filterSubnetworkByContext <- function(nodes,
+                                      edges,
+                                      similarity_cutoff = 0.10,
+                                      query = "DNA damage repair cancer oncology") {


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "filterSubnetworkByContext.R" 2>/dev/null | head -20

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 49

🏁 Script executed:

git ls-files | grep -i filtersubnetwork

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 129

🏁 Script executed:

cat -n R/filterSubnetworkByContext.R | head -100

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 5006

🏁 Script executed:

cat -n R/filterSubnetworkByContext.R | sed -n '25,100p'

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 3555

🏁 Script executed:

wc -l R/filterSubnetworkByContext.R

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 100

🏁 Script executed:

cat -n R/filterSubnetworkByContext.R

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 12275

Validate query and similarity_cutoff before vectorising.

query must be a single character string. If length > 1, all_texts at line 54 combines multiple query strings with abstracts, creating a DTM with extra rows. This causes scores at line 76–78 to have length = nrow(query) + nrow(abstracts_df), which fails to assign to abstracts_df$similarity at line 80 (length mismatch). Non-scalar similarity_cutoff will also recycle silently in the comparison at line 83.

Proposed fix

filterSubnetworkByContext <- function(nodes, edges, similarity_cutoff = 0.10, query = "DNA damage repair cancer oncology") { + if (!is.character(query) || length(query) != 1L || is.na(query) || !nzchar(query)) { + stop("`query` must be a single non-empty character string.") + } + if (!is.numeric(similarity_cutoff) || length(similarity_cutoff) != 1L || + is.na(similarity_cutoff) || !is.finite(similarity_cutoff) || + similarity_cutoff < -1 || similarity_cutoff > 1) { + stop("`similarity_cutoff` must be a single numeric value in [-1, 1].") + } # ── 1. Extract evidence text from edges ─────────────────────────────────── evidence <- .extract_evidence_text(edges)

coderabbitai · 2026-03-10T22:48:52Z

R/filterSubnetworkByContext.R

+    evidence_scored <- merge(
+        evidence,
+        abstracts_df[, c("pmid", "similarity")],
+        by   = "pmid",
+        all.x = TRUE
+    )
+    evidence_scored$similarity[is.na(evidence_scored$similarity)] <- 0
+
+    # Keep only evidence rows whose PMID passed
+    evidence_filtered <- evidence_scored[
+        evidence_scored$pmid %in% passing_pmids,
+        c("source", "target", "interaction", "site",
+          "evidenceLink", "stmt_hash", "text", "pmid", "similarity")
+    ]


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find and read the file find . -name "filterSubnetworkByContext.R" -type f

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 98

🏁 Script executed:

# Get context around the relevant lines head -120 R/filterSubnetworkByContext.R | tail -50

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 2022

🏁 Script executed:

# Read the full function to understand the complete context cat -n R/filterSubnetworkByContext.R | head -110

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 5542

🏁 Script executed:

# Find the fetch function to understand when abstracts might be NA rg "\.fetch_clean_abstracts_xml" -A 5 R/

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 825

🏁 Script executed:

# Look for the function definition rg "fetch_clean_abstracts" --type r -B 2 -A 15 R/

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 2480

🏁 Script executed:

# Get the complete .fetch_clean_abstracts_xml function sed -n '/^\.fetch_clean_abstracts_xml/,/^}/p' R/filterSubnetworkByContext.R | head -50

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 1105

🏁 Script executed:

# Check how abstracts_df handles missing abstracts - specifically look at what happens # when abstract is empty string vs NA rg "abstract_list|abstracts_df" R/filterSubnetworkByContext.R -A 2 -B 2

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 1035

🏁 Script executed:

# Check for tests or examples fd "test.*filter" --type f fd "filterSubnetwork" --type f

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 129

Don't collapse unscored abstracts into similarity 0.

Line 99 converts NA to 0, which happens when abstracts fail to fetch or aren't returned by PubMed. This conflates transient fetch failures with genuinely low-scoring abstracts, causing valid evidence to be removed from the network at line 103 (filtered by passing_pmids with cutoff ≥ 0.10). Instead, preserve evidence with unknown similarity scores.

Proposed fix

evidence_scored <- merge( evidence, abstracts_df[, c("pmid", "similarity")], by = "pmid", all.x = TRUE ) - evidence_scored$similarity[is.na(evidence_scored$similarity)] <- 0 + if (all(is.na(evidence_scored$similarity))) { + warning("No abstracts could be scored — returning unfiltered inputs.") + return(list(nodes = nodes, edges = edges, evidence = evidence_scored)) + } # Keep only evidence rows whose PMID passed evidence_filtered <- evidence_scored[ - evidence_scored$pmid %in% passing_pmids, + !is.na(evidence_scored$similarity) & + evidence_scored$similarity >= similarity_cutoff, c("source", "target", "interaction", "site", "evidenceLink", "stmt_hash", "text", "pmid", "similarity") ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@R/filterSubnetworkByContext.R` around lines 93 - 106, Remove the forced conversion of missing similarity values to 0 (the line setting evidence_scored$similarity[is.na(...)] <- 0) so unknown/failed-fetch abstracts remain NA, and update the logic that computes passing_pmids (the filter that uses similarity >= cutoff) to treat NA as passing by using a condition like is.na(similarity) OR similarity >= cutoff; this ensures evidence_scored, evidence_filtered, abstracts_df, similarity and passing_pmids retain and preserve unknown scores rather than conflating them with zero.

coderabbitai · 2026-03-10T22:48:52Z

R/filterSubnetworkByContext.R

+    # Keep nodes present in surviving edges
+    surviving_nodes  <- union(edges_filtered$source, edges_filtered$target)
+    nodes_filtered   <- nodes[nodes[[1]] %in% surviving_nodes, ]   # assumes first col is node ID


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file find . -name "filterSubnetworkByContext.R" -type f

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 98

🏁 Script executed:

# Read the specific file and lines to verify the current code cat -n R/filterSubnetworkByContext.R | sed -n '100,120p'

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 1158

🏁 Script executed:

# Search for the entire function to understand context cat -n R/filterSubnetworkByContext.R | head -150

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 7257

🏁 Script executed:

# Look for calls to this function to understand how nodes is passed rg "filterSubnetworkByContext" --type r -A 3 -B 3

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 2814

🏁 Script executed:

# Search for getSubnetworkFromIndra to understand how it creates the nodes object rg "getSubnetworkFromIndra" --type r -A 20

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 16819

🏁 Script executed:

# Find the getSubnetworkFromIndra function grep -n "\.constructNodesDataFrame\|\..*nodes" R/getSubnetworkFromIndra.R | head -20

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 362

🏁 Script executed:

# Search for how nodes dataframe is built in the main function cat -n R/getSubnetworkFromIndra.R | tail -150

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 5716

🏁 Script executed:

# Look for internal helper functions that construct the nodes dataframe find . -name "*.R" -type f -exec grep -l "constructNodesDataFrame\|nodes.*data\.frame" {} \;

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 253

🏁 Script executed:

# Check utils file that was referenced cat -n R/utils_getSubnetworkFromIndra.R | grep -A 50 "nodes"

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 6650

🏁 Script executed:

# Search for the actual nodes dataframe construction logic rg "nodes\s*<-" R/ -A 5 -B 2

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 5185

Use nodes$id instead of nodes[[1]] and validate the column exists.

The function is exported and can be called with nodes dataframes from external sources. If callers reorder columns before invoking this function, nodes[[1]] silently becomes incorrect. While getSubnetworkFromIndra() materialises an id column (line 346 of utils_getSubnetworkFromIndra.R), this function should not rely on column order. Add validation to fail fast when the column is missing, and use explicit column reference for consistency with the rest of the codebase (which uses nodes$id elsewhere).

Proposed fix

# Keep nodes present in surviving edges surviving_nodes <- union(edges_filtered$source, edges_filtered$target) - nodes_filtered <- nodes[nodes[[1]] %in% surviving_nodes, ] # assumes first col is node ID + if (!"id" %in% names(nodes)) { + stop("`nodes` must contain an `id` column.") + } + nodes_filtered <- nodes[nodes$id %in% surviving_nodes, ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Keep nodes present in surviving edges

surviving_nodes <- union(edges_filtered$source, edges_filtered$target)

nodes_filtered <- nodes[nodes[[1]] %in% surviving_nodes, ] # assumes first col is node ID

# Keep nodes present in surviving edges

surviving_nodes <- union(edges_filtered$source, edges_filtered$target)

if (!"id" %in% names(nodes)) {

stop("`nodes` must contain an `id` column.")

}

nodes_filtered <- nodes[nodes$id %in% surviving_nodes, ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@R/filterSubnetworkByContext.R` around lines 112 - 114, Replace the positional column access nodes[[1]] with an explicit nodes$id reference in the nodes_filtered assignment, and add a pre-check that the nodes data.frame contains an "id" column (e.g., stop with a clear error if !"id" %in% colnames(nodes)); update the nodes_filtered creation that currently uses surviving_nodes and edges_filtered to filter by nodes$id to avoid silent breakage when columns are reordered; this keeps behavior consistent with other functions like getSubnetworkFromIndra() and fails fast when the required column is missing.

coderabbitai · 2026-03-10T22:48:52Z

R/filterSubnetworkByContext.R

+            abstract_node <- xml_find_first(doc, ".//AbstractText")
+
+            if (!is.na(abstract_node)) {
+                results[[pmid]] <- xml_text(abstract_node)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the file fd filterSubnetworkByContext.R

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 129

🏁 Script executed:

# Read the file to see the context around lines 217-220 if [ -f "R/filterSubnetworkByContext.R" ]; then wc -l R/filterSubnetworkByContext.R fi

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 100

🏁 Script executed:

# Get the context around lines 217-220 if [ -f "R/filterSubnetworkByContext.R" ]; then sed -n '200,235p' R/filterSubnetworkByContext.R fi

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 1315

🏁 Script executed:

# Search for other uses of AbstractText in the codebase rg "AbstractText" -n

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 168

🏁 Script executed:

# Check if xml_find_first is used elsewhere for abstracts rg "xml_find_first" -n -A 2 -B 2

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 997

🌐 Web query:

PubMed XML structured abstracts multiple AbstractText sections

💡 Result:

In PubMed/MEDLINE XML, a structured abstract is represented as multiple <AbstractText> elements inside a single <Abstract> (and likewise inside <OtherAbstract> for additional-language abstracts). Each <AbstractText> corresponds to one section (e.g., OBJECTIVE, METHODS, RESULTS) and should be read/processed in document order. [1]

Typical patterns you’ll see:

Publisher-supplied PubMed XML (data provider format): sections are encoded with the Label attribute:

<Abstract><AbstractText Label="OBJECTIVE">…</AbstractText><AbstractText Label="METHODS">…</AbstractText>…</Abstract> [1]

MEDLINE/PubMed distributed/efetch-style records: sections may also include NlmCategory (a normalized category such as BACKGROUND/OBJECTIVE/METHODS/RESULTS/CONCLUSIONS; plus UNASSIGNED for unmapped labels). This is what lets you reliably pick, say, only the RESULTS section by filtering AbstractText@NlmCategory="RESULTS". [2][3]

So if you see multiple <AbstractText> nodes, that’s expected: treat them as the abstract’s section blocks (optionally keyed by @Label and/or @NlmCategory), and concatenate or selectively extract as needed. [1][3]

Sources: [1] PubMed Data Provider XML Help (NCBI Bookshelf) • [2] NLM Technical Bulletin on structured abstracts & NlmCategory (incl. UNASSIGNED) • [3] NLM EDirect sample showing filtering on AbstractText@NlmCategory

🌐 Web query:

PubMed efetch XML format AbstractText elements MEDLINE

💡 Result:

In PubMed EFetch XML (db=pubmed&retmode=xml), the abstract text is in one or more <AbstractText> elements under the MEDLINE citation:

Path (typical):
PubmedArticleSet/PubmedArticle/MedlineCitation/Article/Abstract/AbstractText [3], [4]

Unstructured abstract: one <AbstractText> containing the full abstract [3]

Structured abstract: multiple <AbstractText> sections, each optionally carrying:

Label (the section heading text as printed, e.g., “METHODS”)

NlmCategory (controlled values like BACKGROUND|OBJECTIVE|METHODS|RESULTS|CONCLUSIONS|UNASSIGNED) [3], [5]

Also note:

Abstract-like text can appear in <OtherAbstract> (and that element has had a Language attribute added) [5], [4]

Whether a record is indexed for MEDLINE is indicated by MedlineCitation@Status (e.g., Status="MEDLINE") [4]

EFetch parameterization: retmode controls format (XML vs text), and PubMed supports “PubMed XML” output via retmode=xml; for PubMed E-utilities, XML is also the default when retmode is omitted in the updated API [1], [2]

🏁 Script executed:

# Check what xml2 functions are currently imported grep -n "xml_find" R/filterSubnetworkByContext.R

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 195

🏁 Script executed:

# Check the NAMESPACE file to see what's imported from xml2 grep "xml2" NAMESPACE

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 150

🌐 Web query:

xml2 package R xml_find_all function documentation

💡 Result:

xml2::xml_find_all() documentation (reference page): “Find nodes that match an xpath expression.” [1]

Usage

xml_find_all(x, xpath, ns = xml_ns(x), ...)

Method for nodesets: xml_find_all(x, xpath, ns = xml_ns(x), flatten = TRUE, ...) [1]

Key arguments

x: an xml_document, xml_node, or xml_nodeset

xpath: an XPath 1.0 expression string

ns: namespace prefix-to-URI mapping (often from xml_ns(x)); default namespaces must be given an explicit prefix name [1]

Value / behavior

Returns matching nodes as an xml_nodeset; if no matches, returns an empty nodeset

When x is a nodeset, it can return a nodeset or a list of nodesets; flatten controls whether results are combined into one nodeset [1]

Related

xml_find_first() returns only the first match (or an xml_missing node if none) [1]

Sources: [1]

🏁 Script executed:

# Verify that trimws and paste are base R functions python3 << 'EOF' # These are base R functions that don't need imports # trimws - base R (since R 3.2) # paste - base R print("trimws and paste are base R functions (no imports needed)") EOF

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 124

🏁 Script executed:

# Check what needs to be added to NAMESPACE and imports echo "Current xml2 imports in NAMESPACE:" grep "importFrom(xml2" NAMESPACE echo "" echo "Would need to add: importFrom(xml2,xml_find_all)"

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 235

🏁 Script executed:

# Check if there are any other usages of abstract fetching or XML parsing rg "xml_find" -A 3 -B 3

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 1250

Concatenate all abstract sections instead of truncating to the first.

PubMed structured abstracts contain multiple <AbstractText> nodes—one for each section (OBJECTIVE, METHODS, RESULTS, CONCLUSIONS, etc.). Using xml_find_first() retrieves only the first section, causing relevance scoring to miss matches in other sections.

Proposed fix

- abstract_node <- xml_find_first(doc, ".//AbstractText") - - if (!is.na(abstract_node)) { - results[[pmid]] <- xml_text(abstract_node) + abstract_nodes <- xml_find_all(doc, ".//AbstractText") + + if (length(abstract_nodes) > 0) { + results[[pmid]] <- paste(trimws(xml_text(abstract_nodes)), collapse = " ") }

Also update the @importFrom and NAMESPACE to include xml_find_all from the xml2 package.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@R/filterSubnetworkByContext.R` around lines 217 - 220, Replace the single-section extraction using xml_find_first with a concatenation of all <AbstractText> nodes: use xml_find_all(doc, ".//AbstractText") to retrieve every section, combine their xml_text() values (e.g., collapse with a space) and assign the combined string to results[[pmid]] instead of the single abstract_node; additionally update the package metadata to import xml_find_all by adding xml_find_all to the `@importFrom` xml2 line and exporting it in the NAMESPACE so the function is available.

tonywu1999 added 2 commits March 10, 2026 18:01

cosine similarity is somewhat promising, considering you can get a ch…

8c446e0

…atbot to give you way more similar sounding words for free

add vignette for filtering subnetwork by context

dc9fb91

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(filter-subnetwork): Add feature to filter the subnetwork from INDRA by query context#84

feature(filter-subnetwork): Add feature to filter the subnetwork from INDRA by query context#84
tonywu1999 wants to merge 2 commits intodevelfrom
feature-cosine-context

tonywu1999 commented Mar 10, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 10, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 10, 2026

Uh oh!

coderabbitai bot Mar 10, 2026

Uh oh!

coderabbitai bot Mar 10, 2026

Uh oh!

coderabbitai bot Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tonywu1999 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Changes

Testing

Coding Guidelines

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tonywu1999 commented Mar 10, 2026 •

edited

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading