Skip to content

zang-lab/GeneKnow

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GeneKnow

GeneKnow is a command-line tool for discovering and inspecting gene functions in specific biological contexts. Given a list of genes and a context (e.g., a cell type or disease), it searches the biomedical literature, retrieves relevant evidence passages using BM25 scoring, and uses LLM-powered pipelines to summarize, verify, and synthesize findings.

Installation

GeneKnow requires Python >= 3.13.7.

git clone https://github.com/zang-lab/GeneKnow.git
cd GeneKnow
pip install .

This installs the geneknow (and GeneKnow) CLI entry point.


API Keys

GeneKnow requires an OpenAI API key to function. This key is used for all LLM operations, including passage summarization, article summary verification, and final synopsis generation.

An Elsevier API key is optional. It is only required if you want to sort search results by citation count (--sort-cited) or if you choose to use Scopus as your search engine (--search-engine scopus). For standard relevance-based searches via PubMed or Europe PMC, you do not need to provide this key.

Setting environment variables

Linux / macOS (Bash/Zsh)

export OPENAI_API_KEY="sk-..."
export ELSEVIER_API_KEY="..."

To make them persistent across terminal sessions, add the above lines to your shell profile file (e.g., ~/.bashrc, ~/.zshrc, or ~/.bash_profile), then reload:

source ~/.bashrc   # or ~/.zshrc

Windows (Command Prompt)

set OPENAI_API_KEY=sk-...
set ELSEVIER_API_KEY=...

Windows (PowerShell)

$env:OPENAI_API_KEY="sk-..."
$env:ELSEVIER_API_KEY="..."

Usage

GeneKnow provides two subcommands: discover and inspect.

Discover mode

Automatically search the literature and synthesize a synopsis per gene.

Example

Discover the roles three interested genes play in prostate cancer.

geneknow discover \
  -g FOXA1 HOXB13 BRCA1 \
  -c "prostate cancer" PCa \
  -n PCa \
  --max-papers 5

Key arguments

Argument Description
-g, --genes Space-separated gene symbols (e.g., -g FOXA1 HOXB13). Overrides -G if both are provided.
-G, --genes-file Path to a file with one gene symbol per line. Ignored if -g/--genes is provided.
-c, --context Required. Space-separated context aliases (e.g., -c "prostate cancer" PCa).
-n, --name Required. Project name (alphanumeric, hyphens, underscores only). Output goes to outdir/name/.
-o, --outdir Output directory. Defaults to the current directory.
-s, --species Species for gene-alias lookup. Default: human.
--max-papers Max papers to review per gene. Default: 3.
--max-passages Max evidence passages to review per paper. Default: 3.
--search-engine pubmed (default), europepmc, or scopus. Requires ELSEVIER_API_KEY when using Scopus.
--search-limit Max papers to fetch from the search engine per gene. Default: 25.
--sort-cited After fetching search-limit number of most relevant papers, geneknow reviews the most relevant max-papers by default. By setting this flag, geneknow will first sort results by citation count, thus the most cited papers among the search-limit number of most relevant papers are reviewed.
--auto-alias Enable automatic gene alias matching via NCBI Gene.
-N, --suffix-not-allowed Disable suffix matching on context terms (e.g., plurals). Only exact context terms will be used. This does not affect gene-name suffix handling.

Inspect mode

Deep-dive into a single paper specified by PMID or PMCID.

Example

Summarize the functional role of FOXA1 in prostate cancer based on a specific paper (PMID: 40570057).

geneknow inspect \
  -g FOXA1 \
  -c "prostate cancer" PCa \
  --pmid 40570057 \
  -n brca1_PCa

Key arguments

Argument Description
-g, --genes Space-separated gene symbols (e.g., -g FOXA1 HOXB13). Overrides -G if both are provided.
-G, --genes-file Path to a file with one gene symbol per line. Ignored if -g/--genes is provided.
-c, --context Required. Space-separated context aliases (e.g., -c "prostate cancer" PCa).
-n, --name Required. Project name (alphanumeric, hyphens, underscores only). Output goes to outdir/name/.
-o, --outdir Output directory. Defaults to the current directory.
-s, --species Species for gene-alias lookup. Default: human.
--max-passages Max evidence passages to review per paper. Default: 3.
--pmid PubMed ID of the target paper.
--pmcid PubMed Central ID of the target paper.
--auto-alias Enable automatic gene alias matching via NCBI Gene.
-N, --suffix-not-allowed Disable suffix matching on context terms (e.g., plurals). Only exact context terms will be used. This does not affect gene-name suffix handling.

Output

Results are saved under outdir/name/:

  • GeneKnow_report.csv — Summary report per gene
  • GeneKnow_report.html — HTML report (discover mode)
  • token_usage.csv — LLM token usage per gene
  • synopses/ — Gene-level synthesized synopses (discover mode)
  • article_summaries/ — Per-paper summaries (discover mode)
  • evidence_passages/ — Retrieved evidence passages with BM25 scores
  • html/ — Paper-level HTML reports (discover mode)
  • error_genes.txt — List of genes that encountered errors (if any)

License statement

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%