Bootcamp context
9-week Data Analytics Bootcamp · Focus: Web scraping, APIs, data cleaning, unsupervised learning, and a simple content-based recommender using real-world book data.
This project builds a small Book Recommendation dataset by combining:
- ~600 books from Goodreads – Best Books Ever (web scraping).
- ~600 books from Open Library – Trending (Search API + work pages).
From these sources we:
- Collect titles, authors, URLs, ratings, votes, genres, subjects, and years.
- Clean and normalize the data into a single dataset.
- Engineer features like genres, keyword representations, and TF-IDF-based keywords.
- Perform Exploratory Data Analysis (EDA).
- Prepare the ground for a content-based recommendation system.
-
Web Scraping (Goodreads)
Usingrequests + BeautifulSoupto:- Scrape list pages from:
https://www.goodreads.com/list/show/1.Best_Books_Ever - Extract (if available): rank, title, author, author URL, average rating, number of ratings, score, votes, book URL, genres, and first published year.
- Scrape list pages from:
-
API Integration (Open Library)
Using the Open Library Search API:- Query:
trending_score_hourly_sum:[1 TO *] -subject:"content_warning:cover" language:eng -subject:"content_warning:cover" -subject:"content_warning:cover" - Extract from API: title, authors, ratings, rating count, trending scores, work keys.
- Enrich with book page scraping to pull Subjects as genres.
- Query:
-
Data Cleaning & Preprocessing
- Remove duplicates and inconsistent records.
- Normalize text (titles, authors, genres).
- Clean noisy genres (symbols, codes, brackets, etc.).
- Standardize ratings to a 0–5 scale.
- Prepare a combined dataset for modeling and recommendations.
-
Keyword & Feature Engineering
- Build a simple text field from titles and genres.
- Use TF-IDF to generate top keywords per book.
-
EDA
- Top-rated books.
- Most frequent authors.
- Genre distribution.
- Rating vs number of ratings.
- Simple correlations.
- Optional word cloud of titles + genres.
-
Goodreads — Best Books Ever
- Public list pages.
- Scraped with polite delays and User-Agent header.
-
Open Library — Search & Work Pages
https://openlibrary.org/search.json- Sorted by
trending. - Additional genres/subjects scraped from each work page under Subjects.
All scraping/API usage follows the educational intent of the bootcamp project.
project_book_recommend/
├── data/
│ ├── goodreads_best_books_600.csv
│ ├── openlibrary_trending_600.csv
│ └── books_merged_cleaned.csv
├── figures/
│ ├── book_clusters_k_means_genres_text_visualized_pca.png
│ ├── eda_rating_distribution.png
│ ├── eda_top_authors.png
│ ├── eda_top_genres.png
│ ├── eda_rating_vs_num_ratings.png
│ ├── eda_wordcloud_titles_genres.png
│ └── hierarchical_clustering_dendrogram.png
├── my_stremlit_app/
│ └── app.py
├── notebooks/
│ ├── books_data_cleaning_eda.ipynb
│ ├── content_based_recommender.ipynb
│ ├── scrape_goodreads_best_books_ever.ipynb
│ └── scrape_openLibrary_trending_api.ipynb
├── slides/
│ └──
└── .gitattributes
└── README.mdRecommended packages:
pandasnumpyrequestsbeautifulsoup4matplotlibscikit-learnwordcloud(optional, for EDA word cloud)
Install (example):
pip install pandas numpy requests beautifulsoup4 matplotlib scikit-learn wordcloudThen run the notebooks in order:
scrape_goodreads_best_books_ever.ipynbscrape_openLibrary_trending_api.ipynbbooks_data_cleaning_eda.ipynbcontent_based_recommender.ipynb
- Loop over the Best Books Ever list pages.
- Extract:
- Book title & URL
- Author name & URL
- Average rating
- Number of ratings
- Score & votes
- List rank
- For each book page:
- Extract Genres (new layout tags).
- Extract First published year from book details.
- Call
search.jsonwith the trending query. - Extract:
- Title, author(s)
- Ratings average & count (if available)
trending_score_hourly_sum- Work key → build book URL.
- For each work URL:
- Parse the Subjects block as genres.
- Add a
sourcecolumn (goodreads/openlibrary). - Align columns between both datasets.
- Concatenate into a single DataFrame.
What problem are we solving?
After scraping, the genres column is messy:
- Some rows are empty.
- Some look like:
"Fantasy, Young Adult, Adventure". - Others include subjects, noise, codes (
nyt:...,[fic],pz7.1...), or weird punctuation. - We also saw strange list-like strings with brackets and repeated values.
That makes it hard to count genres or use them for recommendations.
What do we do?
For each row:
- Make sure
genresexists and convert to lowercase text. - Remove brackets
[], quotes', and other list-like artifacts. - Replace separators like
/,&,=,--,-with spaces so they don’t break words. - Split mainly by commas to get candidate genres/subjects.
- Clean each candidate:
- Keep only letters and spaces.
- Remove very short or junk tokens.
- Drop obvious technical codes like tags starting with
nyt,collectionid,pz, etc.
- Remove duplicates inside each book’s genre list.
End result
Each book gets a clean list of genres, for example:
"Fantasy, Young Adult, Adventure"
→ ["fantasy", "young adult", "adventure"]
This makes it much easier to:
- Count how many books belong to each genre.
- Plot genre distributions.
- Use genres as features in a recommender.
What problem are we solving?
We want to extract “keywords” that describe each book, but:
- We don’t always have rich descriptions.
- We do have titles and genres/subjects.
So we build a simple combined text field to represent each book.
What do we do?
For each book:
- Take the title (or
""if missing). - Add a space.
- Add the genres string.
- Convert everything to lowercase.
Example:
title: "The Hunger Games"
genres: "young adult, dystopia, fiction"
→ "the hunger games young adult, dystopia, fiction"
We store this in a column like text_for_keywords.
Why this is useful
- This gives us one compact “summary text” per book.
- We can feed this into TF-IDF to find meaningful words.
Now we use TF-IDF to turn that text into keywords.
What is TF-IDF (simple version)?
- TF (Term Frequency): Words that appear more often in a book’s text are more important for that book.
- IDF (Inverse Document Frequency): Words that appear in many books (like “book”, “novel”) are less special.
- TF-IDF is high when:
- A word is frequent in one book’s text.
- But not so common across all books.
So TF-IDF helps us find specific, meaningful words for each book.
What does the code do?
-
Use
TfidfVectorizerontext_for_keywords:- Limit to a small number of features (e.g. 50) to keep it simple.
- Use English stopwords to ignore very common words.
-
For each book:
- Look at its TF-IDF scores.
- Sort words from highest to lowest score.
- Pick the top few words (e.g. 5) with TF-IDF > 0.
-
Save them as
top_keywords, for example:
"dystopia, survival, rebellion, future, young"
Why this is useful
- These
top_keywordsact like a mini “fingerprint” of each book. - We can later:
- Compare books by overlapping keywords.
- Build a simple content-based recommender:
- “If you liked this book, here are others with similar genres/keywords.”
Using the cleaned dataset, we:
- Plot rating distribution (0–5).
- Show Top 10 most frequent authors.
- Plot Top 15 genres (after genre normalization).
- Visualize Rating vs Number of Ratings (with log-scale on counts).
- Compute basic correlations between ratings, votes, and trending scores.
- Generate a word cloud from
text_for_keywords.
All plots are also saved into the figures/ directory for easy use in the presentation or report.
- Implement a Content-Based Recommender using:
- Cleaned genres.
top_keywordsfrom TF-IDF.- Similarity measures (e.g. cosine similarity).
- Add simple search & filter:
- By genre, rating, popularity, etc.
- (Optional) Wrap the recommender in:
- A Jupyter demo notebook, or
- A small Streamlit app.
- Luis Pablo Aiello — Data Analytics Bootcamp Student
This project is intended for educational use within the bootcamp cohort. The dataset is built from publicly available book information (Goodreads and Open Library) and is used solely for learning, exploration, and demonstration purposes.