Project Gutenberg Embedding

A web scraper that is made to parse all Ebooks on Project Gutenberg. Also embedding all of the book's text into a vector storage on MongoDB Atlas with Langchain.

How it works

graph TD;
    A[Scrapy]-->B[MongoDB Atlas];
    A-->C[Langchain];
    B-->D[MongoDB Atlas];
    C-->D;

First the scraper goes to the Category Page and gets all the Ebooks and their metadata (like IDs). Each category page returns 25 Ebooks at a time. To get all the Ebooks, we need to use pagination, to pages like the start_index of 26 (https://www.gutenberg.org/ebooks/bookshelf/57?start_index=26).

The Scrapy Pipeline then gets passed all items, and using Langchain embeds them into MongoDB. For this we use OpenAI. The text content of splits of 1,000 characters is embedded, which is the most important part when analyzing the general text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Gutenberg Embedding

How it works

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Project Gutenberg Embedding

How it works