PrivEval: a tool for interactive evaluation of privacy metrics in synthetic data generation

PrivEval, a tool for assisting users in evaluating the privacy properties of a synthetic dataset. Here, the user can explore how privacy is estimated through privacy metrics as well as their applicability for specific scenarios and the implications thereof. This means that PrivEval is a first step to bridge the gap between privacy experts and the general public for making privacy estimation more transparent. For more information about PrivEval, please read the associated paper.

How to run the demo on your own machine

Make sure that .devcontainer/devcontainer.json contains the correct python engine (Python 3.8 - Python 3.12)
Install the requirements
```
$ pip install -r requirements.txt
```
Run the app
```
$ streamlit run demo.py
```

Tables of available privacy metrics for each attack

The definitions of each privacy metric can be found in the Technical Report. For less elaborate definitions of the privacy metrics, we refer to this table:

Reconstruction risk

Metric	Description
Attribute Inference Risk	AIR measures the risk of inference attacks by assessing how easily an attacker, using public real data and synthetic data, can infer sensitive values. It quantifies this difficulty with the a weighted F1-score.
GeneralizedCAP	GCAP measures the risk of inference attacks by assessing how easily an attacker, using public real data and synthetic data, can infer sensitive values. It quantifies this difficulty with the Correct Attribution Probability (CAP) algorithm.
ZeroCAP	ZCAP measures the risk of inference attacks by assessing how easily an attacker, using public real data and synthetic data, can infer sensitive values. It quantifies this difficulty with the Correct Attribution Probability (CAP) algorithm.

Re-identification risk

Metric	Description
Hidden Rate	Hidden Rate estimates the risk of identifying whether an individual contributed their data to the real dataset while only having access to the synthetic data.
Hitting Rate	Hitting Rate measures the risk of identifying whether an individual contributed their data to the real dataset while having access to the synthetic data. To identify whether or not you contributed your data, the metrics tries to find individuals with matching categorical and similar continuous attribute values.
Membership Inference Risk	MIR estimates the risk of identifying whether an individual contributed their data to the real dataset while having access to the synthetic data and a subset of the real data. Here, a classifier is trained to determine whether individuals are synthetic or real and estimates privacy through this.
Nearest Neighbour Adversarial Accuracy	NNAA estimates the risk of identifying whether an individual contributed their data to the real dataset while only having access to the synthetic data. This is done by mapping the datasets to 2D and estimated using distances to nearest neoghbors.

Membership inference / Tracing risk

Metric	Description
Authenticity	Auth measures the risk of re-identification by assessing how easily an attacker, using the synthetic data, can infer the individual from which it was generated. The Auth risk is measured as the probability that a synthetic nearest neighbour is closer than a real nearest neighbour over the real dataset.
Close Value Probability	CVP measures the risk of re-identification by assessing how easily an attacker, using the synthetic data, can infer the individual from which it was generated. This is measured as a probabilistic likelihood of synthetic individuals being 'too close'.
Common Rows Proportion	CRP measures the risk of re-identification as a probability of a real individual's row being a row in the synthetic data.
Distance to Closest Record	DCR measures the risk of re-identification by assessing how easily an attacker, using the synthetic data, can infer the individual from which it was generated. This is done by measuring the distance to the nearest neighbour in data transformed to 2 dimensions.
DetectionMLP	D-MLP measures the risk of re-identification by assessing how easily an attacker, using the synthetic data, can infer the individual from which it was generated, while having access to a subset of the real data. Here an MLP classifier is trained to real individuals and tested on synthetic individuals.
Distant Value Probability	DVP measures the risk of re-identification by assessing how easily an attacker, using the synthetic data, can infer the individual from which it was generated. This is measured as a reverse probabilistic likelihood of synthetic individuals being 'too far away'.
Identifiability Score	IdScore estimates the risk of re-identifying any real individual while only having access to the synthetic data. It estimates this as the probability that the distance to the closest synthetic individual is closer than the distance from the closest real individual in weighted versions of the real and synthetic dataset.
Median Distance to Closest Record	MDCR measures the risk of re-identification by assessing how easily an attacker, using the synthetic data, can infer the individual from which it was generated. This is measured as the median distance between the real and synthetic data points.
Nearest Synthetic Neighbor Distance	SND measures the risk of re-identification by assessing how easily an attacker, using the synthetic data, can infer the individual from which it was generated through a distance measure. The score is calculated as the mean min-max reduced distance to the nearest synthetic neighbour.
Nearest Neighbor Distance Ratio	NNDR measures the risk of re-identification by assessing how easily an attacker, using the synthetic data, can infer the individual from which it was generated. This is measure as the distance ratio between real and synthetic data.

Requirements

The input datasets must be single table CSV files with no individuals repeating in the dataset for correct metric calculation. The input datasets must not contain missing values

Other information

A python notebook is available to generate the real and synthetic dataset using the file gen_dataset.ipynb.

For some metrics, we refer to their individual implementations in the Metrics folder to change e.g. thresholds and iterations.

Citing

If you use this code, please cite the associated paper.

PrivEval:

@article{trudslev2025priveval,
	year = 2025,
	volume = {18},
	number = {12},
	pages = {5271 - 5274},
        doi = {10.14778/3750601.3750649},
	author = {Frederik Marinus Trudslev and Matteo Lissandrini and Juan Manuel Rodriguez and Martin Bøgsted and Daniele Dell'Aglio},
	title = {PrivEval: a tool for interactive evaluation of privacy metrics in
synthetic data generation},
	journal = {Proceedings of the {VLDB} Endowment}
}

Technical Report:

@misc{trudslev2025reviewprivacymetricsprivacypreserving,
      title={A Review of Privacy Metrics for Privacy-Preserving Synthetic Data Generation}, 
      author={Frederik Marinus Trudslev and Matteo Lissandrini and Juan Manuel Rodriguez and Martin Bøgsted and Daniele Dell'Aglio},
      year={2025},
      eprint={2507.11324},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2507.11324}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.devcontainer		.devcontainer
Data		Data
Metrics		Metrics
demo_syn		demo_syn
images		images
metric_results		metric_results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
gen_dataset.ipynb		gen_dataset.ipynb
get_metric_results.py		get_metric_results.py
get_results.py		get_results.py
requirements.txt		requirements.txt
sample_data_1.csv		sample_data_1.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrivEval: a tool for interactive evaluation of privacy metrics in synthetic data generation

How to run the demo on your own machine

Tables of available privacy metrics for each attack

Reconstruction risk

Re-identification risk

Membership inference / Tracing risk

Requirements

Other information

Citing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PrivEval: a tool for interactive evaluation of privacy metrics in synthetic data generation

How to run the demo on your own machine

Tables of available privacy metrics for each attack

Reconstruction risk

Re-identification risk

Membership inference / Tracing risk

Requirements

Other information

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages