MPCS 53014 Final Project

Available here!

Data

The dataset used in this project is the goodbooks-10k dataset, which contains six million Goodreads ratings for the ten thousand most popular books on Goodreads, as of 2017. The total size of the tables used in this project was around 100 MB.

Batch Layer

For the batch layer, I used PySpark to train an alternating least squares (ALS) model on the ratings. As it turned out, predicting for new users wasn't a trivial task—the PySpark ALS model doesn't come with built-in functionality for "folding in" users and making predictions for them, so I had to implement this by hand using the product features. The ALS model produces a small set of latent factors that essentially describe products and users, and which can be used to predict products for users, or vice versa. These product features are only 8 MB in size, and can be stored in-memory on our web server. I serialized and saved them in an S3 bucket.

Serving Layer

For the serving layer, I used Flask. When a user inputs a Goodreads ID for a user, it retrieves all of that user's ratings using the Goodreads API. It then calculates a best guess latent representation for the new user by multiplying the user's rating vector with the product features, then using that to get predicted ratings for all the items and taking the top n. Lastly, it displays the results, fetching the relevant information for each book using the Goodreads API. The steps involving this API are all rather slow, and so results can take several seconds to load. I thought about putting some of the book info (like title, description, and image url, which are shown on the frontend) in Hive, but the description data wasn't available as part of the original dataset I used and Goodreads has measures in place against scraping.

Homepage:

Sample results:

Notes and addenda

The results I got subjectively aren't that good, and there are some quirks. Some books seem to show up far more often than they should, likely because they are overrepresented in the ratings. Some books are just bizarre picks. There may be an issue with my understanding of the ALS model or my linear algebra knowledge, which is extremely shaky at best. I found a comment from a user who questioned whether the ALS model actually produces singular vectors, a critical assumption that I had made, in which case the method I used to find latent representations for new users doesn't actually produce meaningful results. I was searching for explanations in the data, code, and model hyperparameters for the past week, and I only noticed this post at the last minute. If possible, I ask that you take this into consideration when grading. Thank you for reading!

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
assets		assets
bin		bin
src		src
README.md		README.md
appspec.yml		appspec.yml
batch.py		batch.py
update.sh		update.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MPCS 53014 Final Project

Available here!

Data

Batch Layer

Serving Layer

Notes and addenda

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MPCS 53014 Final Project

Available here!

Data

Batch Layer

Serving Layer

Notes and addenda

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages