Diyago · Diyago · May 27, 2025 · May 27, 2025
diff --git a/README.md b/README.md
@@ -4,30 +4,78 @@
 
 # GANs and TimeGANs, Diffusions, LLM for tabular  data
 
-<img src="./images/tabular_gan.png" height="15%" width="15%">
+<img src="images/tabular_gan.png" height="15%" width="15%">
 
-Generative  Networks are well-known for their success in realistic image generation. However, they can also be applied to generate tabular data. We introduce major improvements for generating high-fidelity tabular data giving oppotunity to try GANS, TimeGANs, Diffusions and LLM for tabular data generations. 
+Generative Networks are well-known for their success in realistic image generation. However, they can also be applied to generate tabular data. This library introduces major improvements for generating high-fidelity tabular data by offering a diverse suite of cutting-edge models, including Generative Adversarial Networks (GANs), specialized TimeGANs for time-series data, Denoising Diffusion Probabilistic Models (DDPM), and Large Language Model (LLM) based approaches. These enhancements allow for robust data generation across various dataset complexities and distributions, giving an opportunity to try GANs, TimeGANs, Diffusions, and LLMs for tabular data generation.
 * Arxiv article: ["Tabular GANs for uneven distribution"](https://arxiv.org/abs/2010.00638)
-* Medium post: [GANs for tabular data](https://towardsdatascience.com/review-of-gans-for-tabular-data-a30a2199342)
+* Medium post: GANs for tabular data [link broken]
 
 ## How to use library
 
 * Installation: `pip install tabgan`
 * To generate new data to train by sampling and then filtering by adversarial training
-  call `GANGenerator().generate_data_pipe`:
+  call `GANGenerator().generate_data_pipe`.
 
 ### Data Format
 
-TabGAN accepts data as a ```numpy.ndarray``` or ```pandas.DataFrame``` with columns categorized as:
+TabGAN accepts data as a `numpy.ndarray` or `pandas.DataFrame` with columns categorized as:
 
-* **Continuous Columns**: Numerical columns with any possible value.
-* **Discrete Columns**: Columns with a limited set of values (e.g., categorical data).
+*   **Continuous Columns**: Numerical columns with any possible value.
+*   **Discrete Columns**: Columns with a limited set of values (e.g., categorical data).
 
 Note: TabGAN does not differentiate between floats and integers, so all values are treated as floats. For integer requirements, round the output outside of TabGAN.
 
-### Example code 
+### Sampler Parameters
+
+All samplers (`OriginalGenerator`, `GANGenerator`, `ForestDiffusionGenerator`, `LLMGenerator`) share the following input parameters:
+
+*   **gen_x_times**: `float` (default: `1.1`) - How much data to generate. The output might be less due to postprocessing and adversarial filtering.
+*   **cat_cols**: `list` (default: `None`) - A list of column names to be treated as categorical.
+*   **bot_filter_quantile**: `float` (default: `0.001`) - The bottom quantile for postprocess filtering. Values below this quantile will be filtered out.
+*   **top_filter_quantile**: `float` (default: `0.999`) - The top quantile for postprocess filtering. Values above this quantile will be filtered out.
+*   **is_post_process**: `bool` (default: `True`) - Whether to perform post-filtering. If `False`, `bot_filter_quantile` and `top_filter_quantile` are ignored.
+*   **adversarial_model_params**: `dict` (default: see below) - Parameters for the adversarial filtering model. Default values are optimized for binary classification tasks.
+    ```python
+    {
+        "metrics": "AUC", "max_depth": 2, "max_bin": 100,
+        "learning_rate": 0.02, "random_state": 42, "n_estimators": 100,
+    }
+    ```
+*   **pregeneration_frac**: `float` (default: `2`) - For the generation step, `gen_x_times * pregeneration_frac` amount of data will be generated. However, after postprocessing, the aim is to return an amount of data equivalent to `(1 + gen_x_times)` times the size of the original dataset (if `only_generated_data` is `False`, otherwise `gen_x_times` times the size of the original dataset).
+*   **only_generated_data**: `bool` (default: `False`) - If `True`, only the newly generated data is returned, without concatenating the input training dataframe.
+*   **gen_params**: `dict` (default: see below) - Parameters for the underlying generative model training. Specific to `GANGenerator` and `LLMGenerator`.
+    *   For `GANGenerator`:
+        ```python
+        {"batch_size": 500, "patience": 25, "epochs" : 500}
+        ```
+    *   For `LLMGenerator`:
+        ```python
+        {"batch_size": 32, "epochs": 4, "llm": "distilgpt2", "max_length": 500}
+        ```
+
+The available samplers are:
+1.  **`GANGenerator`**: Utilizes the Conditional Tabular GAN (CTGAN) architecture, known for effectively modeling tabular data distributions and handling mixed data types (continuous and discrete). It learns the data distribution and generates synthetic samples that mimic the original data.
+2.  **`ForestDiffusionGenerator`**: Implements a novel approach using diffusion models guided by tree-based methods (Forest Diffusion). This technique is capable of generating high-quality synthetic data, particularly for complex tabular structures, by gradually adding noise to data and then learning to reverse the process.
+3.  **`LLMGenerator`**: Leverages Large Language Models (LLMs) using the GReaT (Generative Realistic Tabular data) framework. It transforms tabular data into a text format, fine-tunes an LLM on this representation, and then uses the LLM to generate new tabular instances by sampling from it. This approach is particularly promising for capturing complex dependencies and can generate diverse synthetic data.
+4.  **`OriginalGenerator`**: Acts as a baseline sampler. It typically returns the original training data or a direct sample from it. This is useful for comparison purposes to evaluate the effectiveness of more complex generative models.
+
+
+### `generate_data_pipe` Method Parameters
+
+The `generate_data_pipe` method, available for all samplers, uses the following parameters:
+
+*   **train_df**: `pd.DataFrame` - The training dataframe (features only, without the target variable).
+*   **target**: `pd.DataFrame` - The input target variable for the training dataset.
+*   **test_df**: `pd.DataFrame` - The test dataframe. The newly generated training dataframe should be statistically similar to this.
+*   **deep_copy**: `bool` (default: `True`) - Whether to make a copy of the input dataframes. If `False`, input dataframes will be modified in place.
+*   **only_adversarial**: `bool` (default: `False`) - If `True`, only adversarial filtering will be performed on the training dataframe; no new data will be generated.
+*   **use_adversarial**: `bool` (default: `True`) - Whether to perform adversarial filtering.
+*   **@return**: `Tuple[pd.DataFrame, pd.DataFrame]` - A tuple containing the newly generated/processed training dataframe and the corresponding target.
+
+
+### Example Code
 
-``` python
+```python
 from tabgan.sampler import OriginalGenerator, GANGenerator, ForestDiffusionGenerator, LLMGenerator
 import pandas as pd
 import numpy as np
@@ -46,45 +94,27 @@ new_train4, new_target4 = LLMGenerator(gen_params={"batch_size": 32,
                                                           "epochs": 4, "llm": "distilgpt2", "max_length": 500}).generate_data_pipe(train, target, test, )
 
 # example with all params defined
-new_train4, new_target4 = GANGenerator(gen_x_times=1.1, cat_cols=None,
-           bot_filter_quantile=0.001, top_filter_quantile=0.999, is_post_process=True,
-           adversarial_model_params={
-               "metrics": "AUC", "max_depth": 2, "max_bin": 100, 
-               "learning_rate": 0.02, "random_state": 42, "n_estimators": 100,
-           }, pregeneration_frac=2, only_generated_data=False,
-           gen_params = {"batch_size": 500, "patience": 25, "epochs" : 500,}).generate_data_pipe(train, target,
-                                          test, deep_copy=True, only_adversarial=False, use_adversarial=True)
-```
+new_train_gan_all_params, new_target_gan_all_params = GANGenerator(
+    gen_x_times=1.1,
+    cat_cols=None,
+    bot_filter_quantile=0.001,
+    top_filter_quantile=0.999,
+    is_post_process=True,
+    adversarial_model_params={
+        "metrics": "AUC", "max_depth": 2, "max_bin": 100,
+        "learning_rate": 0.02, "random_state": 42, "n_estimators": 100,
+    },
+    pregeneration_frac=2,
+    only_generated_data=False,
+    gen_params={"batch_size": 500, "patience": 25, "epochs": 500}
+).generate_data_pipe(
+    train, target, test,
+    deep_copy=True,
+    only_adversarial=False,
+    use_adversarial=True
+)
 
-All samplers `OriginalGenerator`, `ForestDiffusionGenerator`, `LLMGenerator` and `GANGenerator` have same input parameters.
-
-1. **GANGenerator** based on **CTGAN**
-2. **ForestDiffusionGenerator** based on **Forest Diffusion (Tabular Diffusion and Flow-Matching)**
-2. **LLMGenerator** based on **Language Models are Realistic Tabular Data Generators (GReaT framework)**
-
-* **gen_x_times**: float = 1.1 - how much data to generate, output might be less because of postprocessing and
-  adversarial filtering
-* **cat_cols**: list = None - categorical columns
-* **bot_filter_quantile**: float = 0.001 - bottom quantile for postprocess filtering
-* **top_filter_quantile**: float = 0.999 - top quantile for postprocess filtering
-* **is_post_process**: bool = True - perform or not post-filtering, if false bot_filter_quantile and top_filter_quantile
-  ignored
-* **adversarial_model_params**: dict params for adversarial filtering model, default values for binary task
-* **pregeneration_frac**: float = 2 - for generation step gen_x_times * pregeneration_frac amount of data will
-  be generated. However, in postprocessing (1 + gen_x_times) % of original data will be returned
-* **gen_params**: dict params for GAN training
-
-For `generate_data_pipe` methods params:
-
-* **train_df**: pd.DataFrame Train dataframe which has separate target
-* **target**: pd.DataFrame Input target for the train dataset
-* **test_df**: pd.DataFrame Test dataframe - newly generated train dataframe should be close to it
-* **deep_copy**: bool = True - make copy of input files or not. If not input dataframes will be overridden
-* **only_adversarial**: bool = False - only adversarial filtering to train dataframe will be performed
-* **use_adversarial**: bool = True - perform or not adversarial filtering
-* **only_generated_data**: bool = False  - After generation get only newly generated, without 
-  concatenating input train dataframe.  
-* **@return**: -> Tuple[pd.DataFrame, pd.DataFrame] - Newly generated train dataframe and test data
+```
 
 Thus, you may use this library to improve your dataset quality:
 
@@ -105,12 +135,13 @@ new_train1, new_target1 = OriginalGenerator().generate_data_pipe(X_train, y_trai
 print("OriginalGenerator metric", fit_predict(clf, new_train1, new_target1, X_test, y_test))
 
 new_train1, new_target1 = GANGenerator().generate_data_pipe(X_train, y_train, X_test, )
-print("GANGenerator metric", fit_predict(clf, new_train1, new_target1, X_test, y_test))
+print("GANGenerator metric", fit_predict(clf, new_train2, new_target2, X_test, y_test)) # Corrected variable name
 ```
-## Timeseries GAN generation TimeGAN
 
-You can easily adjust code to generate multidimensional timeseries data.
-Basically it extracts days, months and year from _date_. Demo how to use in the example below:
+### Advanced Usage: Generating Time-Series Data with TimeGAN
+
+You can easily adjust the code to generate multidimensional time-series data. This approach primarily involves extracting day, month, and year components from a date column to be used as features in the generation process. Below is a demonstration:
+
 ```python
 import pandas as pd
 import numpy as np
@@ -126,7 +157,7 @@ min_date = pd.to_datetime('2019-01-01')
 max_date = pd.to_datetime('2021-12-31')
 d = (max_date - min_date).days + 1
 
-train['Date'] = min_date + pd.to_timedelta(pd.np.random.randint(d, size=train_size), unit='d')
+train['Date'] = min_date + pd.to_timedelta(np.random.randint(d, size=train_size), unit='d')
 train = get_year_mnth_dt_from_date(train, 'Date')
 
 new_train, new_target = GANGenerator(gen_x_times=1.1, cat_cols=['year'], bot_filter_quantile=0.001,
@@ -150,24 +181,23 @@ compare_dataframes(original_df, generated_df) # return between 0 and 1
 
 To run experiment follow these steps:
 
-1. Clone the repository. All required dataset are stored in `./Research/data` folder
-2. Install requirements `pip install -r requirements.txt`
-4. Run all experiments  `python ./Research/run_experiment.py`. Run all experiments  `python run_experiment.py`. You may
-   add more datasets, adjust validation type and categorical encoders.
-5. Observe metrics across all experiment in console or in `./Research/results/fit_predict_scores.txt`
+1. Clone the repository. All required datasets are stored in `./Research/data` folder.
+2. Install requirements: `pip install -r requirements.txt`
+3. Run experiments using `python ./Research/run_experiment.py`. You may
+   add more datasets, adjust validation type, and categorical encoders.
+4. Observe metrics across all experiments in the console or in `./Research/results/fit_predict_scores.txt`.
 
 
 **Experiment design**
 
-![Experiment design and workflow](./images/workflow.png?raw=true)
+![Experiment design and workflow](images/workflow.png)
 
 **Picture 1.1** Experiment design and workflow
 
 ## Results
-To determine the best sampling strategy, ROC AUC scores of each dataset were scaled (min-max scale) and then averaged
-among the dataset.
+The table below (Table 1.2) shows ROC AUC scores for different sampling strategies. To facilitate comparison across datasets with potentially different baseline AUC scores, the ROC AUC scores for each dataset were scaled using min-max normalization (where the maximum score achieved by any method on that dataset becomes 1, and the minimum becomes 0). These scaled scores were then averaged across all datasets for each sampling strategy. Therefore, a higher value in the table indicates better relative performance in generating data that is difficult for a classifier to distinguish from the original data, when compared to other methods on the same set of datasets.
 
-**Table 1.2** Different sampling results across the dataset, higher is better (100% - maximum per dataset ROC AUC)
+**Table 1.2** Averaged Min-Max Scaled ROC AUC scores for different sampling strategies across datasets. Higher is better (closer to 1 indicates performance similar to the best method on each dataset).
 
 | dataset_name  |   None |   gan |   sample_original |
 |:-----------------------|-------------------:|------------------:|------------------------------:|
@@ -180,7 +210,7 @@ among the dataset.
 
 ## Citation
 
-If you use **GAN-for-tabular-data** in a scientific publication, we would appreciate references to the following BibTex entry:
+If you use **tabgan** in a scientific publication, we would appreciate references to the following BibTex entry:
 arxiv publication:
 ```bibtex
 @misc{ashrapov2020tabular,
@@ -195,11 +225,10 @@ arxiv publication:
 
 ## References
 
-[1] Lei Xu LIDS, Kalyan Veeramachaneni. Synthesizing Tabular Data using Generative Adversarial Networks (2018). arXiv:
-1811.11264v1 [cs.LG]
+[1] Xu, L., & Veeramachaneni, K. (2018). *Synthesizing Tabular Data using Generative Adversarial Networks*. arXiv:1811.11264 [cs.LG].
 
-[2] Alexia Jolicoeur-Martineau and Kilian Fatras and Tal Kachman. Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees ((2023) https://github.com/SamsungSAILMontreal/ForestDiffusion [cs.LG]
+[2] Jolicoeur-Martineau, A., Fatras, K., & Kachman, T. (2023). *Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees*. Retrieved from https://github.com/SamsungSAILMontreal/ForestDiffusion.
 
-[3] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, (2019)
+[3] Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). *Modeling Tabular data using Conditional GAN*. NeurIPS.
 
-[4] Vadim Borisov and Kathrin Sessler and Tobias Leemann and Martin Pawelczyk and Gjergji Kasneci. Language Models are Realistic Tabular Data Generators. ICLR, (2023)
+[4] Borisov, V., Sessler, K., Leemann, T., Pawelczyk, M., & Kasneci, G. (2023). *Language Models are Realistic Tabular Data Generators*. ICLR.