A small ingestion pipeline that builds a resource dataset from web links, summarizes each page with Gemini, and stores searchable text in Upstash Vector.
This project runs in three stages for testing with local sources.json:
- Scrape each resource URL from
sources.jsonand store raw HTML + dataset metadata underdataset/. - Generate a summary for each page and upsert it into Upstash Vector.
- Query the vector index for relevant resources.
pnpm installCreate a .env file in the project root:
# Gemini
GEMINI_API_KEY=your_gemini_api_key
# Upstash Vector
UPSTASH_VECTOR_REST_URL=your_upstash_vector_rest_url
UPSTASH_VECTOR_REST_TOKEN=your_upstash_vector_rest_tokenRun commands in this order:
# Make sure Chromium is installed for Puppeteer before scraping
pnpx puppeteer browsers install chrome
# 1) Scrape pages from sources.json and build dataset/index.json + dataset/pageContent/*.html
pnpm generate-dataset
# 2) Summarize page content and upsert into Upstash Vector
pnpm generate-summary
# 3) Query vectors (pass your search text)
pnpm query-vectors "react authentication guide"pnpm generate-dataset-> runstsx generateDataset.tspnpm generate-summary-> runstsx generateSummary.tspnpm query-vectors "<your query>"-> runstsx queryVectors.ts
dataset/index.json- dataset entries with metadata + pageContent path/nulldataset/pageContent/*.html- scraped HTML pages
dataset/ is ignored by git and should be regenerated when needed.
- For this testing flow, keep your links in
sources.json;generate-sourcesis not required. - Some URLs may fail to scrape; those entries are still kept with metadata-only fallback.
- Summarization handles Gemini 429 rate limits by waiting and retrying.
- Optional: set
TOP_Kin.envto control how many query matches are returned (default5).