Skip to content

dusicastepic/ADMFourthHomework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Homework 4

The rendered version of the entire project is divided in three parts (some graphs visible only here)

The goal of the first part of this project was to do the scrapping of the announcements from the immobiliare.it website and do the clustering. In the second part of this project, the aim was to define a hash function that associates a value to each string(from given file with passwords) and checks whether there are some duplicate strings.

Clustering

Steps of project work:

1. The scrapping of the data from immobiliare.it
2. Cleaning of the data
3. Making description (tf_idf values based on the announcement's descriptions) and information data set
   ( values of: price(prezzo), locals(locali), number of bathrooms(bagno), surface(superficie), floor(piano) )
4. Using k-means++ in elbow method to determine the optimal k (number of clusters) for each data set
5. Compare clusters and find the 3 most similar ones using Jaccard similarity
6. Make word clouds for those 3 clusters

Hash function

Steps:

1. Convert the strings containing the passwords from the file to a (potentially large) number
2. Use a hash function to map the number to a large range
3. Find the number of collisions and duplicates

The repository consists of the following files:

  1. Clustering part.ipynb:

    A Jupyter notebook which provides the code of the Clustering part of the project.

  2. Hashing.ipynb:

    A Jupyter notebook which provides the code of the Hashing part of the project.

  3. Scraper.ipynb:

    A Jupyter notebook which provides the code of the Scrapping data for the clustering implemented in the Clustering part.ipynb notebook.

  4. libscrap.py:

    A python script which provides all the functions used in the Scraper.ipynb notebook.

About

Fourth homework for Algorithmic Methods of Data Mining - Group #25

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors