From 99a3db6ac7df890214c0a1dd2595af8752b33571 Mon Sep 17 00:00:00 2001 From: tianhao Date: Wed, 17 Jun 2026 23:54:09 +0800 Subject: [PATCH] fix docs api correctness --- README.md | 12 ++++++------ docs/data_and_terminology.md | 7 ++++++- docs/in_memory_api.md | 13 ++++++------- 3 files changed, 18 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index a43a8eb..1dc28ba 100644 --- a/README.md +++ b/README.md @@ -194,18 +194,18 @@ Both code paths support the following DP mechanisms: ## Further Documentation -Detailed guides are available in the [`documentation/`](dpsynth/documentation/) +Detailed guides are available in the [`docs/`](docs/) directory: -* **In-Memory DataFrame API Guide** (`documentation/in_memory_api.md`): +* **In-Memory DataFrame API Guide** (`docs/in_memory_api.md`): Detailed guide to using the Pandas-based API and local CLI. -* **Scalable Pipeline API Guide** (`documentation/scalable_beam_api.md`): +* **Scalable Pipeline API Guide** (`docs/scalable_beam_api.md`): Guide for distributed data generation. -* **Data Model & Terminology** (`documentation/data_and_terminology.md`): +* **Data Model & Terminology** (`docs/data_and_terminology.md`): Attributes, schema specifications, and `domain.yaml` format. -* **Processing Lifecycle** (`documentation/processing_lifecycle.md`): +* **Processing Lifecycle** (`docs/processing_lifecycle.md`): The 5-stage mathematical lifecycle shared by both code paths. -* **Contributor Guide** (`documentation/contributors_guide.md`): +* **Contributor Guide** (`docs/contributors_guide.md`): Architecture, PipelineBackend programming rules, and evaluation framework. *This is not an officially supported Google product. This project is diff --git a/docs/data_and_terminology.md b/docs/data_and_terminology.md index d4928ac..752b544 100644 --- a/docs/data_and_terminology.md +++ b/docs/data_and_terminology.md @@ -52,7 +52,12 @@ string categories. * **Boolean (`BOOL`)**: True/False binary flags. * **Enum ### 3. Record Independence (Differential Privacy Assumption) -It is assumed that each **record** comes from different **privacy unit**. +> [!IMPORTANT] +> DPSynth provides record-level differential privacy: each **record** is assumed +> to come from a different **privacy unit**. If one person or entity can +> contribute multiple rows, callers must enforce the appropriate user-level +> contribution bounds before running DPSynth; otherwise the guarantee is not +> user-level DP. ## Supported Attribute Classifications diff --git a/docs/in_memory_api.md b/docs/in_memory_api.md index a9868a2..4204e78 100644 --- a/docs/in_memory_api.md +++ b/docs/in_memory_api.md @@ -31,7 +31,7 @@ synthetic_df = dpsynth.generate( epsilon: float, delta: float, *, - discrete_config: discrete_mechanisms.DiscreteMechanismConfig = discrete_mechanisms.MSTConfig(), + discrete_config: discrete_mechanisms.DiscreteMechanism = discrete_mechanisms.MSTMechanism(), numerical_bins: int = 32, one_way_marginal_budget_fraction: float = 0.1, cross_attribute_constraints: list = (), @@ -63,8 +63,7 @@ synthetic_df = dpsynth.generate( ## End-to-End Python Example Here is a complete Python script demonstrating how to load data, parse a domain -YAML file, configure the AIM mechanism with a fixed random seed, and generate -synthetic records. +YAML file, configure the AIM mechanism, and generate synthetic records. ```python import dpsynth @@ -80,8 +79,7 @@ attribute_domains = domain.from_yaml_file("transaction_domain.yaml") # 3. Configure the synthesis mechanism (AIM) aim_config = discrete_mechanisms.AIMConfig( - seed=42, - rounds=50, + max_rounds=50, pgm_iters=1000, ) @@ -130,8 +128,9 @@ python3 bin/main.py \ * `--epsilon`, `--delta`: Total DP privacy budget. * `--mechanism`: Supported options are `mst`, `aim`, `independent`, and `aim_gdp`. -* `--seed`: Integer seed for reproducible randomness across DP sampling and - PGM inference. +* `--seed`: Seeds NumPy's legacy global random state. The in-memory generator + also creates `np.random.default_rng()` internally, so identical CLI + invocations are not guaranteed to be bit-for-bit reproducible. * `--output_path`: Destination filepath where the synthetic CSV will be written.