-
Notifications
You must be signed in to change notification settings - Fork 0
feature(filter-subnetwork): Add feature to filter the subnetwork from INDRA by query context #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,267 @@ | ||||||||||||||||||||
| #' Filter a subnetwork by contextual relevance using TF-IDF cosine similarity | ||||||||||||||||||||
| #' | ||||||||||||||||||||
| #' Fetches PubMed abstracts for evidence PMIDs, scores each abstract against a | ||||||||||||||||||||
| #' user-supplied text query, and returns only the nodes, edges, and evidence | ||||||||||||||||||||
| #' rows whose abstracts meet the similarity cutoff. | ||||||||||||||||||||
| #' | ||||||||||||||||||||
| #' @param nodes A dataframe of network nodes. | ||||||||||||||||||||
| #' @param edges A dataframe of network edges with columns: source, target, | ||||||||||||||||||||
| #' interaction, site, evidenceLink, stmt_hash. | ||||||||||||||||||||
| #' @param similarity_cutoff Numeric in [-1, 1]. Only evidence whose abstract | ||||||||||||||||||||
| #' scores >= this value is retained. Default 0.10. | ||||||||||||||||||||
| #' @param query Character string. The text query to compare against abstracts. | ||||||||||||||||||||
| #' Expand with synonyms / related terms for better recall. | ||||||||||||||||||||
| #' | ||||||||||||||||||||
| #' @return A named list with three elements: | ||||||||||||||||||||
| #' \item{nodes}{Filtered nodes dataframe (only nodes present in kept edges)} | ||||||||||||||||||||
| #' \item{edges}{Filtered edges dataframe} | ||||||||||||||||||||
| #' \item{evidence}{Dataframe with columns: source, target, interaction, site, | ||||||||||||||||||||
| #' evidenceLink, stmt_hash, text, pmid, similarity} | ||||||||||||||||||||
| #' | ||||||||||||||||||||
| #' @importFrom text2vec itoken word_tokenizer create_vocabulary prune_vocabulary | ||||||||||||||||||||
| #' vocab_vectorizer create_dtm TfIdf fit_transform | ||||||||||||||||||||
| #' @importFrom stopwords stopwords | ||||||||||||||||||||
| #' @export | ||||||||||||||||||||
| filterSubnetworkByContext <- function(nodes, | ||||||||||||||||||||
| edges, | ||||||||||||||||||||
| similarity_cutoff = 0.10, | ||||||||||||||||||||
| query = "DNA damage repair cancer oncology") { | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # ── 1. Extract evidence text from edges ─────────────────────────────────── | ||||||||||||||||||||
| evidence <- .extract_evidence_text(edges) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| if (nrow(evidence) == 0) { | ||||||||||||||||||||
| warning("No evidence text found — returning unfiltered inputs.") | ||||||||||||||||||||
| return(list(nodes = nodes, edges = edges, evidence = evidence)) | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # ── 2. Fetch PubMed abstracts for unique PMIDs ──────────────────────────── | ||||||||||||||||||||
| pmids <- unique(evidence$pmid[nchar(evidence$pmid) > 0]) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| if (length(pmids) == 0) { | ||||||||||||||||||||
| warning("No PMIDs found in evidence — returning unfiltered inputs.") | ||||||||||||||||||||
| return(list(nodes = nodes, edges = edges, evidence = evidence)) | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| abstract_list <- .fetch_clean_abstracts_xml(pmids) | ||||||||||||||||||||
| abstracts_df <- data.frame( | ||||||||||||||||||||
| pmid = names(abstract_list), | ||||||||||||||||||||
| abstract = unlist(abstract_list, use.names = FALSE), | ||||||||||||||||||||
| stringsAsFactors = FALSE | ||||||||||||||||||||
| ) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # ── 3. TF-IDF vectorisation (query + all abstracts) ─────────────────────── | ||||||||||||||||||||
| all_texts <- c(query, abstracts_df$abstract) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| tokens <- itoken(all_texts, | ||||||||||||||||||||
| preprocessor = tolower, | ||||||||||||||||||||
| tokenizer = word_tokenizer) | ||||||||||||||||||||
| vocab <- create_vocabulary(tokens, stopwords = stopwords("en")) | ||||||||||||||||||||
| vocab <- prune_vocabulary(vocab, term_count_min = 1) | ||||||||||||||||||||
| vectorizer <- vocab_vectorizer(vocab) | ||||||||||||||||||||
| dtm <- create_dtm(tokens, vectorizer) | ||||||||||||||||||||
| tfidf <- TfIdf$new() | ||||||||||||||||||||
| dtm_tfidf <- fit_transform(dtm, tfidf) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # ── 4. Cosine similarity: query (row 1) vs each abstract ───────────────── | ||||||||||||||||||||
| .cos_sim <- function(a, b) { | ||||||||||||||||||||
| a <- as.numeric(a) | ||||||||||||||||||||
| b <- as.numeric(b) | ||||||||||||||||||||
| sum(a * b) / (sqrt(sum(a^2)) * sqrt(sum(b^2))) | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| query_vec <- dtm_tfidf[1, , drop = FALSE] | ||||||||||||||||||||
| abstract_vecs <- dtm_tfidf[-1, , drop = FALSE] | ||||||||||||||||||||
|
|
||||||||||||||||||||
| scores <- sapply(seq_len(nrow(abstract_vecs)), function(i) { | ||||||||||||||||||||
| .cos_sim(query_vec, abstract_vecs[i, , drop = FALSE]) | ||||||||||||||||||||
| }) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| abstracts_df$similarity <- round(scores, 4) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # ── 5. Filter abstracts by similarity cutoff ────────────────────────────── | ||||||||||||||||||||
| passing_pmids <- abstracts_df$pmid[abstracts_df$similarity >= similarity_cutoff] | ||||||||||||||||||||
|
|
||||||||||||||||||||
| cat(sprintf( | ||||||||||||||||||||
| "\n%d / %d abstracts passed similarity cutoff (>= %.2f)\n", | ||||||||||||||||||||
| length(passing_pmids), nrow(abstracts_df), similarity_cutoff | ||||||||||||||||||||
| )) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # ── 6. Filter evidence, edges, nodes ───────────────────────────────────── | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # Join similarity score onto evidence; drop abstract text | ||||||||||||||||||||
| evidence_scored <- merge( | ||||||||||||||||||||
| evidence, | ||||||||||||||||||||
| abstracts_df[, c("pmid", "similarity")], | ||||||||||||||||||||
| by = "pmid", | ||||||||||||||||||||
| all.x = TRUE | ||||||||||||||||||||
| ) | ||||||||||||||||||||
| evidence_scored$similarity[is.na(evidence_scored$similarity)] <- 0 | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # Keep only evidence rows whose PMID passed | ||||||||||||||||||||
| evidence_filtered <- evidence_scored[ | ||||||||||||||||||||
| evidence_scored$pmid %in% passing_pmids, | ||||||||||||||||||||
| c("source", "target", "interaction", "site", | ||||||||||||||||||||
| "evidenceLink", "stmt_hash", "text", "pmid", "similarity") | ||||||||||||||||||||
| ] | ||||||||||||||||||||
|
Comment on lines
+93
to
+106
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: # Find and read the file
find . -name "filterSubnetworkByContext.R" -type fRepository: Vitek-Lab/MSstatsBioNet Length of output: 98 🏁 Script executed: # Get context around the relevant lines
head -120 R/filterSubnetworkByContext.R | tail -50Repository: Vitek-Lab/MSstatsBioNet Length of output: 2022 🏁 Script executed: # Read the full function to understand the complete context
cat -n R/filterSubnetworkByContext.R | head -110Repository: Vitek-Lab/MSstatsBioNet Length of output: 5542 🏁 Script executed: # Find the fetch function to understand when abstracts might be NA
rg "\.fetch_clean_abstracts_xml" -A 5 R/Repository: Vitek-Lab/MSstatsBioNet Length of output: 825 🏁 Script executed: # Look for the function definition
rg "fetch_clean_abstracts" --type r -B 2 -A 15 R/Repository: Vitek-Lab/MSstatsBioNet Length of output: 2480 🏁 Script executed: # Get the complete .fetch_clean_abstracts_xml function
sed -n '/^\.fetch_clean_abstracts_xml/,/^}/p' R/filterSubnetworkByContext.R | head -50Repository: Vitek-Lab/MSstatsBioNet Length of output: 1105 🏁 Script executed: # Check how abstracts_df handles missing abstracts - specifically look at what happens
# when abstract is empty string vs NA
rg "abstract_list|abstracts_df" R/filterSubnetworkByContext.R -A 2 -B 2Repository: Vitek-Lab/MSstatsBioNet Length of output: 1035 🏁 Script executed: # Check for tests or examples
fd "test.*filter" --type f
fd "filterSubnetwork" --type fRepository: Vitek-Lab/MSstatsBioNet Length of output: 129 Don't collapse unscored abstracts into similarity 0. Line 99 converts Proposed fix evidence_scored <- merge(
evidence,
abstracts_df[, c("pmid", "similarity")],
by = "pmid",
all.x = TRUE
)
- evidence_scored$similarity[is.na(evidence_scored$similarity)] <- 0
+ if (all(is.na(evidence_scored$similarity))) {
+ warning("No abstracts could be scored — returning unfiltered inputs.")
+ return(list(nodes = nodes, edges = edges, evidence = evidence_scored))
+ }
# Keep only evidence rows whose PMID passed
evidence_filtered <- evidence_scored[
- evidence_scored$pmid %in% passing_pmids,
+ !is.na(evidence_scored$similarity) &
+ evidence_scored$similarity >= similarity_cutoff,
c("source", "target", "interaction", "site",
"evidenceLink", "stmt_hash", "text", "pmid", "similarity")
]🤖 Prompt for AI Agents |
||||||||||||||||||||
|
|
||||||||||||||||||||
| # Keep edges that have at least one surviving evidence row | ||||||||||||||||||||
| surviving_hashes <- unique(evidence_filtered$stmt_hash) | ||||||||||||||||||||
| edges_filtered <- edges[edges$stmt_hash %in% surviving_hashes, ] | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # Keep nodes present in surviving edges | ||||||||||||||||||||
| surviving_nodes <- union(edges_filtered$source, edges_filtered$target) | ||||||||||||||||||||
| nodes_filtered <- nodes[nodes[[1]] %in% surviving_nodes, ] # assumes first col is node ID | ||||||||||||||||||||
|
Comment on lines
+112
to
+114
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: # First, locate and examine the file
find . -name "filterSubnetworkByContext.R" -type fRepository: Vitek-Lab/MSstatsBioNet Length of output: 98 🏁 Script executed: # Read the specific file and lines to verify the current code
cat -n R/filterSubnetworkByContext.R | sed -n '100,120p'Repository: Vitek-Lab/MSstatsBioNet Length of output: 1158 🏁 Script executed: # Search for the entire function to understand context
cat -n R/filterSubnetworkByContext.R | head -150Repository: Vitek-Lab/MSstatsBioNet Length of output: 7257 🏁 Script executed: # Look for calls to this function to understand how nodes is passed
rg "filterSubnetworkByContext" --type r -A 3 -B 3Repository: Vitek-Lab/MSstatsBioNet Length of output: 2814 🏁 Script executed: # Search for getSubnetworkFromIndra to understand how it creates the nodes object
rg "getSubnetworkFromIndra" --type r -A 20Repository: Vitek-Lab/MSstatsBioNet Length of output: 16819 🏁 Script executed: # Find the getSubnetworkFromIndra function
grep -n "\.constructNodesDataFrame\|\..*nodes" R/getSubnetworkFromIndra.R | head -20Repository: Vitek-Lab/MSstatsBioNet Length of output: 362 🏁 Script executed: # Search for how nodes dataframe is built in the main function
cat -n R/getSubnetworkFromIndra.R | tail -150Repository: Vitek-Lab/MSstatsBioNet Length of output: 5716 🏁 Script executed: # Look for internal helper functions that construct the nodes dataframe
find . -name "*.R" -type f -exec grep -l "constructNodesDataFrame\|nodes.*data\.frame" {} \;Repository: Vitek-Lab/MSstatsBioNet Length of output: 253 🏁 Script executed: # Check utils file that was referenced
cat -n R/utils_getSubnetworkFromIndra.R | grep -A 50 "nodes"Repository: Vitek-Lab/MSstatsBioNet Length of output: 6650 🏁 Script executed: # Search for the actual nodes dataframe construction logic
rg "nodes\s*<-" R/ -A 5 -B 2Repository: Vitek-Lab/MSstatsBioNet Length of output: 5185 Use The function is exported and can be called with nodes dataframes from external sources. If callers reorder columns before invoking this function, Proposed fix # Keep nodes present in surviving edges
surviving_nodes <- union(edges_filtered$source, edges_filtered$target)
- nodes_filtered <- nodes[nodes[[1]] %in% surviving_nodes, ] # assumes first col is node ID
+ if (!"id" %in% names(nodes)) {
+ stop("`nodes` must contain an `id` column.")
+ }
+ nodes_filtered <- nodes[nodes$id %in% surviving_nodes, ]📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||
|
|
||||||||||||||||||||
| cat(sprintf( | ||||||||||||||||||||
| "Retained: %d edges (of %d), %d nodes (of %d), %d evidence rows (of %d)\n", | ||||||||||||||||||||
| nrow(edges_filtered), nrow(edges), | ||||||||||||||||||||
| nrow(nodes_filtered), nrow(nodes), | ||||||||||||||||||||
| nrow(evidence_filtered), nrow(evidence_scored) | ||||||||||||||||||||
| )) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| return(list( | ||||||||||||||||||||
| nodes = nodes_filtered, | ||||||||||||||||||||
| edges = edges_filtered, | ||||||||||||||||||||
| evidence = evidence_filtered | ||||||||||||||||||||
| )) | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
|
|
||||||||||||||||||||
| # ── Internal helpers ────────────────────────────────────────────────────────── | ||||||||||||||||||||
|
|
||||||||||||||||||||
| #' Extract evidence text from edges dataframe via INDRA API | ||||||||||||||||||||
| #' @param df Edges dataframe with columns: source, target, interaction, site, | ||||||||||||||||||||
| #' evidenceLink, stmt_hash | ||||||||||||||||||||
| #' @return Dataframe with additional columns: text, pmid | ||||||||||||||||||||
| #' @keywords internal | ||||||||||||||||||||
| #' @noRd | ||||||||||||||||||||
| .extract_evidence_text <- function(df) { | ||||||||||||||||||||
|
|
||||||||||||||||||||
| required_cols <- c("source", "target", "interaction", "site", "evidenceLink", "stmt_hash") | ||||||||||||||||||||
| missing_cols <- setdiff(required_cols, names(df)) | ||||||||||||||||||||
| if (length(missing_cols) > 0) { | ||||||||||||||||||||
| stop(sprintf("Missing required columns: %s", paste(missing_cols, collapse = ", "))) | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| results_list <- list() | ||||||||||||||||||||
| result_count <- 0 | ||||||||||||||||||||
| unique_hashes <- unique(df$stmt_hash) | ||||||||||||||||||||
| n_hashes <- length(unique_hashes) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| cat(sprintf("Processing %d unique statement hashes...\n", n_hashes)) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| for (i in seq_along(unique_hashes)) { | ||||||||||||||||||||
| stmt_hash <- unique_hashes[i] | ||||||||||||||||||||
|
|
||||||||||||||||||||
| if (i %% 10 == 0) cat(sprintf("Progress: %d/%d\n", i, n_hashes)) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| evidence_list <- .query_indra_evidence(stmt_hash) | ||||||||||||||||||||
| if (is.null(evidence_list) || length(evidence_list) == 0) next | ||||||||||||||||||||
|
|
||||||||||||||||||||
| matching_indices <- which(df$stmt_hash == stmt_hash) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| for (evidence in evidence_list) { | ||||||||||||||||||||
| if (!is.null(evidence[["text"]]) && nchar(evidence[["text"]]) > 0) { | ||||||||||||||||||||
| for (idx in matching_indices) { | ||||||||||||||||||||
| result_count <- result_count + 1 | ||||||||||||||||||||
| results_list[[result_count]] <- data.frame( | ||||||||||||||||||||
| source = df$source[idx], | ||||||||||||||||||||
| target = df$target[idx], | ||||||||||||||||||||
| interaction = df$interaction[idx], | ||||||||||||||||||||
| site = df$site[idx], | ||||||||||||||||||||
| evidenceLink = df$evidenceLink[idx], | ||||||||||||||||||||
| stmt_hash = df$stmt_hash[idx], | ||||||||||||||||||||
| text = evidence[["text"]], | ||||||||||||||||||||
| pmid = if (is.null(evidence[["pmid"]])) "" else evidence[["pmid"]], | ||||||||||||||||||||
| stringsAsFactors = FALSE | ||||||||||||||||||||
| ) | ||||||||||||||||||||
| } | ||||||||||||||||||||
| } | ||||||||||||||||||||
| } | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| if (result_count == 0) { | ||||||||||||||||||||
| warning("No evidence text found for any statement hash") | ||||||||||||||||||||
| return(data.frame( | ||||||||||||||||||||
| source = character(), target = character(), interaction = character(), | ||||||||||||||||||||
| site = character(), evidenceLink = character(), stmt_hash = character(), | ||||||||||||||||||||
| text = character(), pmid = character(), stringsAsFactors = FALSE | ||||||||||||||||||||
| )) | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| results_df <- do.call(rbind, results_list) | ||||||||||||||||||||
| cat(sprintf("\nComplete! Found %d evidence text entries.\n", nrow(results_df))) | ||||||||||||||||||||
| return(results_df) | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
|
|
||||||||||||||||||||
| #' Fetch and clean PubMed abstracts via rentrez | ||||||||||||||||||||
| #' @param pmids Character vector of PubMed IDs | ||||||||||||||||||||
| #' @return Named list: pmid -> abstract text | ||||||||||||||||||||
| #' @keywords internal | ||||||||||||||||||||
| #' @importFrom rentrez entrez_fetch | ||||||||||||||||||||
| #' @importFrom xml2 read_xml xml_find_first xml_text | ||||||||||||||||||||
| #' @noRd | ||||||||||||||||||||
| .fetch_clean_abstracts_xml <- function(pmids) { | ||||||||||||||||||||
| results <- list() | ||||||||||||||||||||
| total <- length(pmids) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| cat(sprintf("Fetching %d abstracts...\n", total)) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| for (i in seq_along(pmids)) { | ||||||||||||||||||||
| pmid <- pmids[i] | ||||||||||||||||||||
| tryCatch({ | ||||||||||||||||||||
| record <- entrez_fetch(db = "pubmed", id = pmid, rettype = "xml") | ||||||||||||||||||||
| doc <- read_xml(record) | ||||||||||||||||||||
| abstract_node <- xml_find_first(doc, ".//AbstractText") | ||||||||||||||||||||
|
|
||||||||||||||||||||
| if (!is.na(abstract_node)) { | ||||||||||||||||||||
| results[[pmid]] <- xml_text(abstract_node) | ||||||||||||||||||||
|
Comment on lines
+217
to
+220
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: # First, let's find and examine the file
fd filterSubnetworkByContext.RRepository: Vitek-Lab/MSstatsBioNet Length of output: 129 🏁 Script executed: # Read the file to see the context around lines 217-220
if [ -f "R/filterSubnetworkByContext.R" ]; then
wc -l R/filterSubnetworkByContext.R
fiRepository: Vitek-Lab/MSstatsBioNet Length of output: 100 🏁 Script executed: # Get the context around lines 217-220
if [ -f "R/filterSubnetworkByContext.R" ]; then
sed -n '200,235p' R/filterSubnetworkByContext.R
fiRepository: Vitek-Lab/MSstatsBioNet Length of output: 1315 🏁 Script executed: # Search for other uses of AbstractText in the codebase
rg "AbstractText" -nRepository: Vitek-Lab/MSstatsBioNet Length of output: 168 🏁 Script executed: # Check if xml_find_first is used elsewhere for abstracts
rg "xml_find_first" -n -A 2 -B 2Repository: Vitek-Lab/MSstatsBioNet Length of output: 997 🌐 Web query:
💡 Result: In PubMed/MEDLINE XML, a structured abstract is represented as multiple Typical patterns you’ll see:
So if you see multiple Sources: [1] PubMed Data Provider XML Help (NCBI Bookshelf) • [2] NLM Technical Bulletin on structured abstracts & 🌐 Web query:
💡 Result: In PubMed EFetch XML (
Also note:
🏁 Script executed: # Check what xml2 functions are currently imported
grep -n "xml_find" R/filterSubnetworkByContext.RRepository: Vitek-Lab/MSstatsBioNet Length of output: 195 🏁 Script executed: # Check the NAMESPACE file to see what's imported from xml2
grep "xml2" NAMESPACERepository: Vitek-Lab/MSstatsBioNet Length of output: 150 🌐 Web query:
💡 Result:
Usage
Key arguments
Value / behavior
Related
Sources: [1] 🏁 Script executed: # Verify that trimws and paste are base R functions
python3 << 'EOF'
# These are base R functions that don't need imports
# trimws - base R (since R 3.2)
# paste - base R
print("trimws and paste are base R functions (no imports needed)")
EOFRepository: Vitek-Lab/MSstatsBioNet Length of output: 124 🏁 Script executed: # Check what needs to be added to NAMESPACE and imports
echo "Current xml2 imports in NAMESPACE:"
grep "importFrom(xml2" NAMESPACE
echo ""
echo "Would need to add: importFrom(xml2,xml_find_all)"Repository: Vitek-Lab/MSstatsBioNet Length of output: 235 🏁 Script executed: # Check if there are any other usages of abstract fetching or XML parsing
rg "xml_find" -A 3 -B 3Repository: Vitek-Lab/MSstatsBioNet Length of output: 1250 Concatenate all abstract sections instead of truncating to the first. PubMed structured abstracts contain multiple Proposed fix- abstract_node <- xml_find_first(doc, ".//AbstractText")
-
- if (!is.na(abstract_node)) {
- results[[pmid]] <- xml_text(abstract_node)
+ abstract_nodes <- xml_find_all(doc, ".//AbstractText")
+
+ if (length(abstract_nodes) > 0) {
+ results[[pmid]] <- paste(trimws(xml_text(abstract_nodes)), collapse = " ")
}Also update the 🤖 Prompt for AI Agents |
||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| if (i %% 10 == 0 || i == total) { | ||||||||||||||||||||
| cat(sprintf("Progress: %d/%d (%.1f%%)\n", i, total, (i / total) * 100)) | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| Sys.sleep(0.34) # respect NCBI rate limit | ||||||||||||||||||||
| }, error = function(e) { | ||||||||||||||||||||
| results[[pmid]] <- "" | ||||||||||||||||||||
| cat(sprintf("Error fetching PMID %s at %d/%d: %s\n", pmid, i, total, e$message)) | ||||||||||||||||||||
| }) | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| cat("Done fetching abstracts!\n") | ||||||||||||||||||||
| return(results) | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| #' Query INDRA API for evidence text | ||||||||||||||||||||
| #' | ||||||||||||||||||||
| #' @param stmt_hash A statement hash string | ||||||||||||||||||||
| #' @importFrom httr POST status_code content content_type_json | ||||||||||||||||||||
| #' @importFrom jsonlite fromJSON | ||||||||||||||||||||
| #' @noRd | ||||||||||||||||||||
| #' @return A list of evidence objects from the API, or NULL if error | ||||||||||||||||||||
| .query_indra_evidence <- function(stmt_hash) { | ||||||||||||||||||||
| url <- "https://discovery.indra.bio/api/get_evidences_for_stmt_hash" | ||||||||||||||||||||
|
|
||||||||||||||||||||
| tryCatch({ | ||||||||||||||||||||
| response <- POST( | ||||||||||||||||||||
| url, | ||||||||||||||||||||
| body = list(stmt_hash = stmt_hash), | ||||||||||||||||||||
| encode = "json", | ||||||||||||||||||||
| content_type_json() | ||||||||||||||||||||
| ) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| if (status_code(response) != 200) { | ||||||||||||||||||||
| warning(sprintf("API returned status %d for stmt_hash: %s", | ||||||||||||||||||||
| status_code(response), stmt_hash)) | ||||||||||||||||||||
| return(NULL) | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| content(response, as = "parsed") | ||||||||||||||||||||
| }, error = function(e) { | ||||||||||||||||||||
| warning(sprintf("Error querying stmt_hash %s: %s", stmt_hash, e$message)) | ||||||||||||||||||||
| return(NULL) | ||||||||||||||||||||
| }) | ||||||||||||||||||||
| } | ||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -426,3 +426,5 @@ | |
| correlations <- cor(wide_data, use = "pairwise.complete.obs") | ||
| return(correlations) | ||
| } | ||
|
|
||
|
|
||
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: Vitek-Lab/MSstatsBioNet
Length of output: 49
🏁 Script executed:
git ls-files | grep -i filtersubnetworkRepository: Vitek-Lab/MSstatsBioNet
Length of output: 129
🏁 Script executed:
cat -n R/filterSubnetworkByContext.R | head -100Repository: Vitek-Lab/MSstatsBioNet
Length of output: 5006
🏁 Script executed:
Repository: Vitek-Lab/MSstatsBioNet
Length of output: 3555
🏁 Script executed:
Repository: Vitek-Lab/MSstatsBioNet
Length of output: 100
🏁 Script executed:
Repository: Vitek-Lab/MSstatsBioNet
Length of output: 12275
Validate
queryandsimilarity_cutoffbefore vectorising.querymust be a single character string. If length > 1,all_textsat line 54 combines multiple query strings with abstracts, creating a DTM with extra rows. This causesscoresat line 76–78 to have length = nrow(query) + nrow(abstracts_df), which fails to assign toabstracts_df$similarityat line 80 (length mismatch). Non-scalarsimilarity_cutoffwill also recycle silently in the comparison at line 83.Proposed fix
filterSubnetworkByContext <- function(nodes, edges, similarity_cutoff = 0.10, query = "DNA damage repair cancer oncology") { + if (!is.character(query) || length(query) != 1L || is.na(query) || !nzchar(query)) { + stop("`query` must be a single non-empty character string.") + } + if (!is.numeric(similarity_cutoff) || length(similarity_cutoff) != 1L || + is.na(similarity_cutoff) || !is.finite(similarity_cutoff) || + similarity_cutoff < -1 || similarity_cutoff > 1) { + stop("`similarity_cutoff` must be a single numeric value in [-1, 1].") + } # ── 1. Extract evidence text from edges ─────────────────────────────────── evidence <- .extract_evidence_text(edges)