Homework 4

The rendered version of the entire project is divided in three parts (some graphs visible only here)

The goal of the first part of this project was to do the scrapping of the announcements from the immobiliare.it website and do the clustering. In the second part of this project, the aim was to define a hash function that associates a value to each string(from given file with passwords) and checks whether there are some duplicate strings.

Clustering

Steps of project work:

1. The scrapping of the data from immobiliare.it
2. Cleaning of the data
3. Making description (tf_idf values based on the announcement's descriptions) and information data set
   ( values of: price(prezzo), locals(locali), number of bathrooms(bagno), surface(superficie), floor(piano) )
4. Using k-means++ in elbow method to determine the optimal k (number of clusters) for each data set
5. Compare clusters and find the 3 most similar ones using Jaccard similarity
6. Make word clouds for those 3 clusters

Hash function

Steps:

1. Convert the strings containing the passwords from the file to a (potentially large) number
2. Use a hash function to map the number to a large range
3. Find the number of collisions and duplicates

The repository consists of the following files:

Clustering part.ipynb:

A Jupyter notebook which provides the code of the Clustering part of the project.
Hashing.ipynb:

A Jupyter notebook which provides the code of the Hashing part of the project.
Scraper.ipynb:

A Jupyter notebook which provides the code of the Scrapping data for the clustering implemented in the Clustering part.ipynb notebook.
libscrap.py:

A python script which provides all the functions used in the Scraper.ipynb notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.gitignore		.gitignore
Clustering part.ipynb		Clustering part.ipynb
Hashing.ipynb		Hashing.ipynb
README.md		README.md
Scraper.ipynb		Scraper.ipynb
house_mask.jpg		house_mask.jpg
libscrap.py		libscrap.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Homework 4

Clustering

Hash function

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Homework 4

Clustering

Hash function

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages