The rendered version of the entire project is divided in three parts (some graphs visible only here)
The goal of the first part of this project was to do the scrapping of the announcements from the immobiliare.it website and do the clustering. In the second part of this project, the aim was to define a hash function that associates a value to each string(from given file with passwords) and checks whether there are some duplicate strings.
Steps of project work:
1. The scrapping of the data from immobiliare.it
2. Cleaning of the data
3. Making description (tf_idf values based on the announcement's descriptions) and information data set
( values of: price(prezzo), locals(locali), number of bathrooms(bagno), surface(superficie), floor(piano) )
4. Using k-means++ in elbow method to determine the optimal k (number of clusters) for each data set
5. Compare clusters and find the 3 most similar ones using Jaccard similarity
6. Make word clouds for those 3 clusters
Steps:
1. Convert the strings containing the passwords from the file to a (potentially large) number
2. Use a hash function to map the number to a large range
3. Find the number of collisions and duplicates
The repository consists of the following files:
-
Clustering part.ipynb:A Jupyter notebook which provides the code of the Clustering part of the project.
-
Hashing.ipynb:A Jupyter notebook which provides the code of the Hashing part of the project.
-
Scraper.ipynb:A Jupyter notebook which provides the code of the Scrapping data for the clustering implemented in the
Clustering part.ipynbnotebook. -
libscrap.py:A python script which provides all the functions used in the
Scraper.ipynbnotebook.
