Skip to content

Latest commit

 

History

History
17 lines (12 loc) · 1.04 KB

File metadata and controls

17 lines (12 loc) · 1.04 KB

Project Gutenberg Embedding

A web scraper that is made to parse all Ebooks on Project Gutenberg. Also embedding all of the book's text into a vector storage on MongoDB Atlas with Langchain.

How it works

graph TD;
    A[Scrapy]-->B[MongoDB Atlas];
    A-->C[Langchain];
    B-->D[MongoDB Atlas];
    C-->D;
Loading

First the scraper goes to the Category Page and gets all the Ebooks and their metadata (like IDs). Each category page returns 25 Ebooks at a time. To get all the Ebooks, we need to use pagination, to pages like the start_index of 26 (https://www.gutenberg.org/ebooks/bookshelf/57?start_index=26).

The Scrapy Pipeline then gets passed all items, and using Langchain embeds them into MongoDB. For this we use OpenAI. The text content of splits of 1,000 characters is embedded, which is the most important part when analyzing the general text.