Code repository for the thesis project Density-based clustering application to substructures identification in cosmological simulations, Francesco Tomba, 2023 @ University of Trieste.
dadac is the porting and optimization of the implementation of ADP (Laio et al. 2021) which is present in the python package dadaPy.
In particular dadac implements at the moment:
- TWO-NN Intrinsic Dimension estimator
- k-NN search using a kd-tree (or vp-tree with custom callable metric)
- k*-NN density estimator
- ADP Heuristics
On the same input dadac achieves a one to one match on the results, obtaining up to a factor 40 speedup on the whole procedure w.r.t. Python/Cython implementation.
Validation of the procedure is done against the implementation of the original package with and without the computation of the halo. Hereafter some of the results, complete ones in the directory benchmarks.
On AMD Ryzen 7 7735HS @ 4.80GHz (my laptop) (8 cores - 16 Threads, 16GB RAM)
from dadapy examples
Fig1.dat: N = 20.0k, D = 2
| Method | part | time |
|---|---|---|
| py | ngbh and density | 1.53s |
| py | ADP | 0.13s |
| C | ngbh and density | 0.26s |
| C | ADP | 0.04s |
from dadapy examples
Fig2.dat: N = 38.4k, D = 2
| Method | part | time |
|---|---|---|
| py | ngbh and density | 2.76s |
| py | ADP | 0.94s |
| C | ngbh and density | 0.22s |
| C | ADP | 0.03s |
CosmoSim (sub)Set1: N = 100.0k, D = 5 (~500 Clusters)
| Method | part | time |
|---|---|---|
| py | ngbh and density | 21.93s |
| py | ADP | 4.08s |
| C | ngbh and density | 1.37s |
| C | ADP | 0.07s |
On Intel Xeon Gold 5118 CPU @ 2.30GHz (4 sockets x 12 cores - 48 Threads, 512GB RAM)
CosmoSim Set1: N = 1.8M, D = 5 (~2000 Clusters)
| Method | part | time |
|---|---|---|
| py | ngbh and density | 414.83s |
| py | ADP | 3282.80s |
| C | ngbh and density | 47.52s |
| C | ADP | 7.46s |
MNIST N = 70.0k, D = 784
| Method | part | time |
|---|---|---|
| py | ngbh and density | 11.62s |
| py | ADP | 2.21s |
| C | ngbh and density | 12.07s |
| C | ADP | 0.18s |
dadac comes with an example driver file driver.c. Data is expected to be a matrix of floats of type FLOAT_TYPE (defined at compile time) of dimensions N x d.
It returns a text file where for each point are reported:
k*: number of neighbors used for computing the density valuecluster_idx: cluster assignement of the pointrho: value of the densityis_center: flag for identifying cluster centers
Once parameters are set dadac can be launched using:
./driver i=[INPUT_FILE] o=[OUTPUT_FILE] d=[d] t=[t] z=[Z] h=[HALO] k=[k] s=[s] t=[t]
INPUT_FILE: input file, file pathOUTPUT_FILE: output file, file pathd: Lenght of the data vectors (number of columns of the data matrix)Z: Z value, floatHALO: Assign halo, y/n [yes/no]k: Number of neighbors to use, int (>0)s: Use sparse borders implementation, y/n [sparse/dense]t: Input binary is in Float32, y/n [float/double]
Relies on kd-Tree or vantage point tree alogirthms in order to perform neighborhood search. V2 versions are optimized ones, use them. KNN search is validated against scipy.spatial.KDtree query time is comparable with state of the art libraries except for bruteforce search used when the dimensionality D > 15.
dadac comes also with a python interface build with ctypes which leverages the capabilities of the C-compiled library. In order to use it, build the package and then import dadac module from python
REQUIRES make build system and a C compiler to properly install (on linux system this works fine)
dadac can be installed with pip on linux systems using:
pip install git+https://github.com/lykos98/dadaC
On MacOS this cannot compile the required shared library if gcc and make are not installed I am still working on that.
If you want to use it on windows use it under WLS
dadac comes with a make file which compiles the executable driver and the shared library bin/libdadac.so from which ADP methods can be linked to.
dadac supports compilation to use float or double to store data and kNN distances, and uint32 or uint64 to store indexes.
Add -DUSE_FLOAT32 or -DUSE_INT32 to compile with support to 32bit types. By default 64bit ones are used. This feature is important for the application on big datasets, allowing of course some sort of rounding error to happen when processing data.
Implementation with 64bit types results are binary equal w.r.t. dadapy.
MUCH MORE
- Complete porting of all density estimation methods in
dadaPy(i.e. pAK) - Adaptive strategies on k-NN search and density estimation