diff --git a/README.md b/README.md index 26cca50..0f894ce 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,97 @@ # MuDBSCAN -A fast, exact, and scalable algorithm for DBSCAN clustering. This repository contains a sequential as well as a distributed memory implementation for the same. +A fast, exact, and scalable algorithm for DBSCAN clustering. +This repository contains the implementation for the distributed spatial clustering algorithm proposed in the paper `μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality` +Link to paper - https://adityaas.github.io/documents/MuDBSCAN_CLUSTER19.pdf -# Code Execution +We propose an extremely efficient way to compute neighbourhood queries that not only improves the average time complexity but exhibits super-linear speed up on large astronomical datasets. Using the distributed variant of our algorithm, we were **able to cluster 1 billion 3D points in under 42 minutes** + +To cite our work please use +``` +A. Sarma et al., "μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality," 2019 IEEE International Conference on Cluster Computing (CLUSTER), Albuquerque, NM, USA, 2019, pp. 1-11, doi: 10.1109/CLUSTER.2019.8891020. +``` + +## Setup +1. Clone the repository +2. Install dependencies (gcc/g++, [open-mpi](https://www.open-mpi.org/)) +3. To run the distributed variant of the algorithm, a MPI cluster has to be [setup](https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/) + +## Running the algorithm +1. Re-format your input according to the template below and store the file in a folder `datasets` + +``` + + + ... +. +. +. +... +``` + +For example +``` +10 +2 +1 20 +2 20 +2 19 +8 15 +8 14 +7 15 +9 14 +9 17 +12 17 +11 18 +``` + +2. Running the sequential algorithm ```shell ./runs.sh + +where +- is the name of the file formatted according to 1. +- represents the neighbourhood parameter (Anything within epislon distance from given point is considered a neighbour) +- represents the density parameter. It defines the minimum number of neighbours required for a point to be classified as `dense` +- and hyper-parameters and correspond to the minimum and maximum degree of the custom defined μC-RTree +``` + +3. Running the distributed algorithm +```shell +./rund.sh + +where , , , , are same as before and +- number of nodes to use within the cluster +- list of nodes configured in the server (hostnames) + ``` + +## Overview +### Overview of the Algorithm +![overview](images/overview.png) + +### Time complexity +![complexity](images/table1.png) + +## Results +1. **Proposed sequential algorithm compared with existing sequential clustering algorithms** +![sequential](images/table2.png) + +2. **Proposed algorithm compared with existing clustering algorithms on 32 nodes** +![sequential](images/table5.png) + +3. **Run-time split up across various steps in μDBSCAN** +**Sequential algorithm split up** +![seq](images/table3.png) + +- **Distributed algorithm split up** +![dist](images/table7.png) + +4. **Speed up across various steps in μDBSCAN** +![speedup](images/table8.png) + +5. **Peak memory consumption of μDBSCAN** +![memory](images/table4.png) + +6. **Scalability of μDBSCAN** +![scalability](images/fig6_7.png) + diff --git a/images/fig5.png b/images/fig5.png new file mode 100644 index 0000000..4747208 Binary files /dev/null and b/images/fig5.png differ diff --git a/images/fig6_7.png b/images/fig6_7.png new file mode 100644 index 0000000..941850a Binary files /dev/null and b/images/fig6_7.png differ diff --git a/images/overview.png b/images/overview.png new file mode 100644 index 0000000..0e72250 Binary files /dev/null and b/images/overview.png differ diff --git a/images/table1.png b/images/table1.png new file mode 100644 index 0000000..b6e8423 Binary files /dev/null and b/images/table1.png differ diff --git a/images/table2.png b/images/table2.png new file mode 100644 index 0000000..823616a Binary files /dev/null and b/images/table2.png differ diff --git a/images/table3.png b/images/table3.png new file mode 100644 index 0000000..6ced006 Binary files /dev/null and b/images/table3.png differ diff --git a/images/table4.png b/images/table4.png new file mode 100644 index 0000000..b3b4bc3 Binary files /dev/null and b/images/table4.png differ diff --git a/images/table5.png b/images/table5.png new file mode 100644 index 0000000..bfb5603 Binary files /dev/null and b/images/table5.png differ diff --git a/images/table7.png b/images/table7.png new file mode 100644 index 0000000..6bc40a7 Binary files /dev/null and b/images/table7.png differ diff --git a/images/table8.png b/images/table8.png new file mode 100644 index 0000000..d09a02a Binary files /dev/null and b/images/table8.png differ diff --git a/rund.sh b/rund.sh new file mode 100644 index 0000000..930a7b3 --- /dev/null +++ b/rund.sh @@ -0,0 +1,16 @@ +path=../datasets/ + +input=$1 +eps=$2 +minpts=$3 + +nodes=$4 +hostfile=$5 +m=$6 +M=$7 + +output=output_$1\_EPS=$eps\_Minpts=$minpts\_nodes=$nodes\_m=$m\M=$M.txt +make clean +make + +mpirun -np $nodes --map-by node --hostfile $hostfile ./output $path$input $eps $minpts $m $M $output diff --git a/runs.sh b/runs.sh index 744fde1..e62592a 100644 --- a/runs.sh +++ b/runs.sh @@ -13,4 +13,4 @@ output=output_$1\_EPS=$eps\_Minpts=$minpts\_m=$m\_M=$M.txt debug=debug_$1\_EPS=$eps\_Minpts=$minpts\_m=$m\_M=$M.txt neighbour=neighbour_$1\_EPS=$eps\_Minpts=$minpts\_m=$m\_M=$M.txt -./output $path$input $eps $minpts $m $M $output \ No newline at end of file +./output $path$input $eps $minpts $m $M $output