explore-rag

A small ingestion pipeline that builds a resource dataset from web links, summarizes each page with Gemini, and stores searchable text in Upstash Vector.

What it does

This project runs in three stages for testing with local sources.json:

Scrape each resource URL from sources.json and store raw HTML + dataset metadata under dataset/.
Generate a summary for each page and upsert it into Upstash Vector.
Query the vector index for relevant resources.

Installation

pnpm install

Environment variables

Create a .env file in the project root:

# Gemini
GEMINI_API_KEY=your_gemini_api_key

# Upstash Vector
UPSTASH_VECTOR_REST_URL=your_upstash_vector_rest_url
UPSTASH_VECTOR_REST_TOKEN=your_upstash_vector_rest_token

Run the pipeline

Run commands in this order:

# Make sure Chromium is installed for Puppeteer before scraping
pnpx puppeteer browsers install chrome

# 1) Scrape pages from sources.json and build dataset/index.json + dataset/pageContent/*.html
pnpm generate-dataset

# 2) Summarize page content and upsert into Upstash Vector
pnpm generate-summary

# 3) Query vectors (pass your search text)
pnpm query-vectors "react authentication guide"

Scripts

pnpm generate-dataset -> runs tsx generateDataset.ts
pnpm generate-summary -> runs tsx generateSummary.ts
pnpm query-vectors "<your query>" -> runs tsx queryVectors.ts

Data outputs

dataset/index.json - dataset entries with metadata + pageContent path/null
dataset/pageContent/*.html - scraped HTML pages

dataset/ is ignored by git and should be regenerated when needed.

Notes

For this testing flow, keep your links in sources.json; generate-sources is not required.
Some URLs may fail to scrape; those entries are still kept with metadata-only fallback.
Summarization handles Gemini 429 rate limits by waiting and retrying.
Optional: set TOP_K in .env to control how many query matches are returned (default 5).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
generateDataset.ts		generateDataset.ts
generateSummary.ts		generateSummary.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
queryVectors.ts		queryVectors.ts
sources.json		sources.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

explore-rag

What it does

Installation

Environment variables

Run the pipeline

Scripts

Data outputs

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

explore-rag

What it does

Installation

Environment variables

Run the pipeline

Scripts

Data outputs

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages