Literature Explorer

Discover research papers you may have missed, starting from a .bib file or folder of PDFs. Built on Semantic Scholar and OpenAlex — both free, no API key required.

Structure

core/
  explorer.py   — resolve corpus, build co-citation graph, score & cluster candidates
  search.py     — search and filter the cache interactively
  enrich.py     — fill in missing abstracts via OpenAlex / CrossRef
  viz.py        — generate interactive visualizations
run.py          — one-command pipeline (explore → enrich → visualize)
pipeline.ipynb  — step-by-step Jupyter notebook
requirements.txt

Installation

Python 3.9+ required.

pip install -r requirements.txt

# Only needed if using --pdf-dir
pip install pypdf

Quick start

# Run the full pipeline (explore → enrich abstracts → visualize)
python run.py --bib references.bib

# With keyword focus and clustering
python run.py --bib references.bib --keywords "open data" "data sharing" --cluster 0

# Use a scoring preset
python run.py --bib references.bib --preset recent

Outputs go to outputs/{project_name}/. The cache is saved next to the .bib file as {name}_cache.json.

How it works

Parse your corpus from a .bib file or PDF folder
Resolve each paper via Semantic Scholar (OpenAlex fallback), fetching metadata + full reference/citation lists in one API call per paper
Build a co-citation graph — papers that recur across your corpus's reference and citation lists are strong candidates
Score candidates on co-citation frequency, citation count, recency, and field relevance
Cluster candidates into topic groups using TF-IDF + KMeans (optional)
Output a Markdown report split into Core and Periphery, a .bib file, and interactive visualizations

`run.py` — full pipeline

python run.py --bib references.bib [options]

Option	Description
`--bib FILE`	Path to `.bib` file
`--pdf-dir DIR`	Path to folder of PDFs (alternative to `--bib`)
`--out DIR`	Output root directory (default: `outputs/`)
`--keywords KW ...`	Focus keywords to re-weight scoring
`--preset`	Scoring preset: `balanced` (default), `highly-cited`, `recent`, `interdisciplinary`
`--cluster N`	Cluster candidates into N topic groups; `0` = auto-detect best k
`--top N`	Top N candidates in visualizations (default: 50)
`--no-viz`	Skip visualization step
`--dry-run`	Skip abstract enrichment API calls

`core/explorer.py` — explore

Resolves your corpus and produces ranked recommendations.

python -m core.explorer --bib references.bib
python -m core.explorer --bib references.bib --keywords "replication" --preset recent
python -m core.explorer --bib references.bib --cluster 5

Scoring presets

Preset	co-cite	citations	recency	field
`balanced` (default)	35%	30%	20%	15%
`highly-cited`	20%	60%	10%	10%
`recent`	25%	15%	50%	10%
`interdisciplinary`	35%	25%	15%	25%

Clustering (`--cluster`)

Groups the top candidate papers into topic clusters using TF-IDF + KMeans on their abstracts. Each cluster is described by its top 5 terms and top 3 papers by score.

--cluster 0 — auto-detect the best number of clusters (via silhouette score, k in 2–8)
--cluster 5 — explicitly request 5 clusters

When clustering is active, the co-citation network colours nodes by cluster instead of score quartile, and a Topic Clusters section is appended to the Markdown report.

Options

--bib FILE          Path to .bib file
--pdf-dir DIR       Folder of PDFs (alternative to --bib)
--out DIR           Output directory (default: outputs/)
--cache FILE        Cache file path (auto-derived from bib name by default)
--keywords KW ...   Focus keywords to re-weight scoring
--preset NAME       Scoring weight preset (balanced / highly-cited / recent / interdisciplinary)
--cluster N         Cluster candidates (0 = auto k, N > 1 = explicit k, omit = skip)

`core/search.py` — search the cache

Search and filter the full paper pool (corpus + all referenced/citing papers) without re-running the pipeline.

python -m core.search --cache references_cache.json "data availability"
python -m core.search --cache references_cache.json "open science" --sort trending
python -m core.search --cache references_cache.json --type meta-analysis --field psychology
python -m core.search --cache references_cache.json --semantic "open science reform"
python -m core.search --cache references_cache.json --stats

Filters

Option	Description
`keywords`	Keyword search (positional). Quote phrases: `'"data availability"'`
`--mode any\|all`	Match any keyword (default) or require all
`--search-in all\|title\|abstract`	Which field to search
`--regex`	Treat keywords as regular expressions
`--field FIELD`	Filter by field of study (substring)
`--author NAME`	Filter by author name (substring)
`--year-min / --year-max`	Publication year range
`--min-citations / --max-citations`	Citation count range
`--venue VENUE`	Filter by journal/venue name (substring)
`--type TYPE`	Heuristic paper type: `review`, `meta-analysis`, `empirical`, `methods`, `theoretical`, `replication`
`--new-only`	Exclude your original corpus papers — only show discovered candidates
`--exclude TERM ...`	Drop papers containing any of these terms

Sorting

--sort relevance   # keyword match score (default)
--sort citations   # total citation count
--sort year        # newest first
--sort trending    # citation velocity = citations / years since publication

Semantic search

# Find papers conceptually related to a query (TF-IDF cosine similarity)
--semantic "open science reform practices"

# Find papers similar to a specific paper already in the cache
--semantic-like "The Role of Preregistration in Psychology"

Export

--export results.bib                          # BibTeX (default)
--export results.csv --export-format csv      # CSV with title/authors/doi/abstract
--export results.ris --export-format ris      # RIS (Endnote/Mendeley/Zotero)
--export notes/ --export-format obsidian      # Markdown notes, one file per paper

`core/enrich.py` — fill missing abstracts

Fetches missing abstracts for papers in the cache via Semantic Scholar batch API, then OpenAlex, then CrossRef.

python -m core.enrich --cache references_cache.json
python -m core.enrich --cache references_cache.json --dry-run   # preview only
python -m core.enrich --cache references_cache.json --limit 50  # cap API calls

`core/viz.py` — visualizations

Generates four interactive outputs from the cache:

python -m core.viz --cache references_cache.json
python -m core.viz --cache references_cache.json --top 80 --no-table

Output	Format	Description
`{proj}_network_{date}.html`	HTML	Force-directed co-citation network (pyvis). Corpus = blue dots, candidates = coloured squares. Colours by cluster if clustering was run, otherwise by score quartile.
`{proj}_timeline_{date}.html`	HTML	Interactive year × citation scatter (Plotly). Hover for abstract snippet; click to open DOI.
`{proj}_fields_{date}.png`	PNG	Field distribution bar chart. Green = overlap with your corpus.
`{proj}_scores_{date}.html`	HTML	Score breakdown table (Plotly). Columns: rank, title (clickable), year, score, co-citations, citations, cluster. Cells colour-coded by value. Includes CSV export button.

Options:

--top N          Top N candidates (default: 50)
--no-network     Skip network
--no-timeline    Skip timeline
--no-fields      Skip field chart
--no-table       Skip score breakdown table

`pipeline.ipynb` — Jupyter notebook

Step-by-step notebook. Set BIB_FILE, KEYWORDS, PRESET, CLUSTER in the config cell and run each step individually.

New cells for advanced search: type filter, semantic search, trending sort, venue filter, new-only mode, exclude terms.

Caching

Each project gets its own cache file next to the .bib file: {name}_cache.json. On subsequent runs, cached papers are loaded instantly. If you interrupt a run, resolved papers are already saved — the next run skips them.

Running with a warm cache still re-runs graph building, scoring, and clustering (fast, in-memory). Only the Semantic Scholar API calls are skipped.

Output layout

outputs/
  {project_name}/
    {project_name}_report_{date}.md
    {project_name}_recommendations_{date}.bib
    {project_name}_network_{date}.html
    {project_name}_timeline_{date}.html
    {project_name}_fields_{date}.png
    {project_name}_scores_{date}.html

API notes

Semantic Scholar: ~100 requests/minute without a key. The script sleeps 1.1s between calls and backs off on rate limits. With a warm cache, subsequent runs make zero calls.

OpenAlex: Used as fallback when SS fails, and for batch abstract enrichment. No key needed; polite pool is used automatically.

CrossRef: Last-resort abstract fallback in enrich.py.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
core		core
.gitignore		.gitignore
README.md		README.md
cache_search.py		cache_search.py
literature_explorer.py		literature_explorer.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Literature Explorer

Structure

Installation

Quick start

How it works

`run.py` — full pipeline

`core/explorer.py` — explore

Scoring presets

Clustering (`--cluster`)

Options

`core/search.py` — search the cache

Filters

Sorting

Semantic search

Export

`core/enrich.py` — fill missing abstracts

`core/viz.py` — visualizations

`pipeline.ipynb` — Jupyter notebook

Caching

Output layout

API notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

levibaruch/refexplorer

Folders and files

Latest commit

History

Repository files navigation

Literature Explorer

Structure

Installation

Quick start

How it works

run.py — full pipeline

core/explorer.py — explore

Scoring presets

Clustering (--cluster)

Options

core/search.py — search the cache

Filters

Sorting

Semantic search

Export

core/enrich.py — fill missing abstracts

core/viz.py — visualizations

pipeline.ipynb — Jupyter notebook

Caching

Output layout

API notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`run.py` — full pipeline

`core/explorer.py` — explore

Clustering (`--cluster`)

`core/search.py` — search the cache

`core/enrich.py` — fill missing abstracts

`core/viz.py` — visualizations

`pipeline.ipynb` — Jupyter notebook

Packages