gooddata · janmatzek · Jul 4, 2025 · Jun 24, 2025 · Jun 27, 2025 · Jul 1, 2025
diff --git a/docs/CUSTOM_FIELDS.md b/docs/CUSTOM_FIELDS.md
@@ -0,0 +1,124 @@
+# Custom Field Management
+
+The `scripts/custom_fields.py` script will allow you to extend the Logical Data Model (LDM) of a child workspace by adding extra datasets which are not present in the parent workspaces' LDM.
+
+## Environment setup
+
+The script relies on `GDC_HOSTNAME` and `GDC_AUTH_TOKEN` environment variables. You can export these by running this in your terminal:
+
+```shell
+export GDC_HOSTNAME=https://your-gooddata-cloud-domain.com
+export GDC_AUTH_TOKEN=your-personal-access-token
+```
+
+## Input files
+
+The script works with input from two CSV files. These files should contain (a) custom dataset definitions and (b) custom field definitions.
+
+The custom dataset defines the dataset entity, i.e., the box you would see in the GoodData Cloud UI. The custom fields, on the other hand, define the individual fields in that dataset. You can imagine it as first defining a table and then its columns.
+
+Multiple datasets and fields can be defined in the files. However, the files need to be consistent with each other - you cannot define fields form datasets that are not defined in the datasets file.
+
+### Custom dataset definitions
+
+The first contains the definitions of the datasets you want to create. It should have following structure:
+
+| workspace_id         | dataset_id        | dataset_name         | dataset_datasource_id | dataset_source_table | dataset_source_sql | parent_dataset_reference | parent_dataset_reference_attribute_id | dataset_reference_source_colum | wdf_id |
+| -------------------- | ----------------- | -------------------- | --------------------- | -------------------- | ------------------ | ------------------------ | ------------------------------------- | ------------------------------ | ------ |
+| child_workspace_id_1 | custom_dataset_id | Custom Dataset Title | datasource_id         | dataset_source_table |                    | parent_dataset_id        | parent_dataset.reference_field        | custom_dataset.reference_field | wdf_id |
+
+#### Validity constraints
+
+- The `dataset_source_table` and `dataset_source_sql` are mutually exclusive. Only one of those should be filled in, the other should be null (empty value). In case both values are present, the script will throw an error.
+
+- `workspace_id` + `dataset_id` combination must be unique across all dataset definitions.
+
+#### JSON representation
+
+For readability, here is the data structure in JSON format with comments. However, note that the script will only work with CSV files!
+
+```json
+{
+  "workspace_id": "child_workspace_id_1", // child workspace id
+  "dataset_id": "custom_dataset_id", // custom dataset id
+  "dataset_name": "Custom Dataset Title", // custom dataset name
+  "dataset_datasource_id": "datasource_id", // data source id -> in the UI, you see it when you go to "manage files"
+  "dataset_source_table": "dataset_source_table", // the name of the table in the physical data model
+  "dataset_source_sql": null, // SQL query defining the dataset
+  "parent_dataset_reference": "products", // ID of the parent dataset to which the custom one will be connected
+  "parent_dataset_reference_attribute_id": "products.product_id", // parent dataset column name used fot the "join"
+  "dataset_reference_source_colum": "product_id", // custom dataset column name used for the "join"
+  "wdf_id": "x__client_id" // workspace data filter id
+}
+```
+
+### Custom fields definition
+
+The individual files of the custom dataset are defined thusly:
+
+| workspace_id         | dataset_id        | cf_id           | cf_name           | cf_type   | cf_source_column           | cf_source_column_data_type |
+| -------------------- | ----------------- | --------------- | ----------------- | --------- | -------------------------- | -------------------------- |
+| child_workspace_id_1 | custom_dataset_id | custom_field_id | Custom Field Name | attribute | custom_field_source_column | INT                        |
+
+#### Validity constraints
+
+The custom field definitions must comply with these criteria:
+
+- **attributes** and **facts**: unique `workspace_id` + `cf_id` combinations
+- **dates**: unique `dataset_id` and `cf_id` combinations
+
+#### JSON representation
+
+Again, here is a JSON definition with comments for readability:
+
+```json
+{
+  "workspace_id": "child_workspace_id_1", // child workspace ID
+  "dataset_id": "custom_dataset_id", // custom dataset ID
+  "cf_id": "custom_field_id", // custom field ID
+  "cf_name": "Custom Field Name", // custom field name
+  "cf_type": "attribute", // GoodData type of the field*
+  "cf_source_column": "custom_field_source_column", // name of the column in the physical data model
+  "cf_source_column_data_type": "INT" // data type of the field*
+}
+```
+
+\* Supported values of **_cf_type_** and **_cf_source_column_data_type_** are listed in `CustomFieldType` and `ColumnDataType` enums in [models](../scripts/custom_fields/models/custom_data_object.py)
+
+## Usage
+
+Now that your environment and input files are set up, let's have a look at how to run the script 🚀.
+
+The script takes two positional arguments, which represent the paths to the input files we have discussed above.
+
+```shell
+python scripts/custom_fields.py custom_datasets.csv custom_fields.csv
+```
+
+There is also an optional flag: `--no-relations-check`. It's meaning is discussed in the next section.
+
+### Check valid relations
+
+Regardless of whether the flag is used or not, the script will always start by loading and validating the data from the provided files. The script will then iterate through workspaces.
+
+#### If unused
+
+If `--no-relations-check` is not used, the script will:
+
+1. Store current workspace layout (analytical objects and LDM).
+1. Check whether relations of metrics, visualizations and dashboards are valid. A set of current objects with invalid relations is created.
+1. Push the updated LDM to GoodData Cloud.
+1. Check object relations again. New set of objects with invalid relations is created.
+1. The sets are compared.
+   - If there is more objects with invalid references in the new set, it means the objects were invalidated. Rollback is required.
+   - If the sets are not equal, rollback might be required
+   - If there is fewer invalid references or the sets are equal, rollback is not required
+1. In case rollback is required, the initally stored workspace layout will be pushed to GoodData Cloud again, reverting changes to the workspace.
+
+#### If used
+
+If you decide to use the `--no-relations-check` flag, the script will simply validate the data and push the LDM extension to GoodData Cloud without any additional checks or rollbacks.
+
+```shell
+python scripts/custom_fields.py custom_datasets.csv custom_fields.csv --no-relations-check
+```
diff --git a/requirements-test.txt b/requirements-test.txt
@@ -1,3 +1,3 @@
 pytest~=7.3.2
-moto~=4.1.11
+moto~=5.1.6
 pytest-mock==3.14.0
diff --git a/requirements.txt b/requirements.txt
@@ -1,4 +1,4 @@
-boto3==1.37.21
+boto3==1.38.45
 gooddata_sdk==1.39.0
 requests==2.32.0
 pydantic==2.11.3
diff --git a/scripts/backup.py b/scripts/backup.py
@@ -6,8 +6,9 @@
 import os
 import shutil
 import tempfile
+import threading
 import time
-from concurrent.futures import ThreadPoolExecutor
+from concurrent.futures import ThreadPoolExecutor, as_completed
 from pathlib import Path
 from typing import Any, Type
 
@@ -327,7 +328,7 @@ def get_workspace_export(
             logger.info(f"Stored export for {ws_id}")
             exported = True
         except Exception as e:
-            logger.error(f"Skipping {ws_id}. Error encountered: {e}")
+            logger.error(f"Skipping {ws_id}. {e.__class__.__name__} encountered: {e}")
 
     if not exported:
         raise RuntimeError(
@@ -414,13 +415,18 @@ def process_batch(
     org_id: str,
     storage: BackupStorage,
     batch: BackupBatch,
+    stop_event: threading.Event,
     retry_count: int = 0,
 ) -> None:
     """Processes a single batch of workspaces for backup.
     If the batch processing fails, the function will wait
     and retry with exponential backoff up to BackupSettings.MAX_RETRIES.
     The base wait time is defined by BackupSettings.RETRY_DELAY.
     """
+    if stop_event.is_set():
+        # If the stop_event flag is set, return. This will terminate the thread.
+        return
+
     try:
         with tempfile.TemporaryDirectory() as tmpdir:
             get_workspace_export(sdk, api, tmpdir, org_id, batch.list_of_ids)
@@ -430,17 +436,24 @@ def process_batch(
             storage.export(tmpdir, org_id)
 
     except Exception as e:
-        # Retry with exponential backoff until MAX_RETRIES, then raise the error
-        if retry_count < BackupSettings.MAX_RETRIES:
+        if stop_event.is_set():
+            return
+
+        elif retry_count < BackupSettings.MAX_RETRIES:
+            # Retry with exponential backoff until MAX_RETRIES.
             next_retry = retry_count + 1
+            wait_time = BackupSettings.RETRY_DELAY**next_retry
             logger.info(
-                f"Unexpected error while processing a batch. Retrying {next_retry}/{BackupSettings.MAX_RETRIES}..."
+                f"{e.__class__.__name__} encountered while processing a batch. "
+                + f"Retrying {next_retry}/{BackupSettings.MAX_RETRIES} in {wait_time} seconds..."
             )
-            time.sleep(BackupSettings.RETRY_DELAY**next_retry)
-            process_batch(sdk, api, org_id, storage, batch, next_retry)
+
+            time.sleep(wait_time)
+            process_batch(sdk, api, org_id, storage, batch, stop_event, next_retry)
         else:
-            logger.error(f"Error processing batch: {e}")
-            raise e
+            # If the batch fails after MAX_RETRIES, raise the error.
+            logger.error(f"Batch failed: {e.__class__.__name__}: {e}")
+            raise
 
 
 def process_batches_in_parallel(
@@ -450,15 +463,38 @@ def process_batches_in_parallel(
     storage: BackupStorage,
     batches: list[BackupBatch],
 ) -> None:
+    """
+    Processes batches in parallel using concurrent.futures. Will stop the processing
+    if any one of the batches fails.
+    """
+
+    # Create a threading flag to control the threads that have already been started
+    stop_event = threading.Event()
+
     with ThreadPoolExecutor(max_workers=BackupSettings.MAX_WORKERS) as executor:
+        # Set the futures tasks.
         futures = []
         for batch in batches:
             futures.append(
-                executor.submit(process_batch, sdk, api, org_id, storage, batch)
+                executor.submit(
+                    process_batch, sdk, api, org_id, storage, batch, stop_event
+                )
             )
 
-        for future in futures:
-            future.result()
+        # Process futures as they complete
+        for future in as_completed(futures):
+            try:
+                future.result()
+            except Exception:
+                # On failure, set the flag to True - signal running processes to stop.
+                stop_event.set()
+
+                # Cancel unstarted threads.
+                for f in futures:
+                    if not f.done():
+                        f.cancel()
+
+                raise
 
 
 def main(args: argparse.Namespace) -> None:

diff --git a/scripts/custom_fields.py b/scripts/custom_fields.py
@@ -0,0 +1,79 @@
+# (C) 2025 GoodData Corporation
+"""Top level script to manage custom datasets and fields in GoodData Cloud.
+
+This script allows you to extend the Logical Data Model (LDM) of a child workspace.
+Documentation and usage instructions are located in `docs/CUSTOM_FIELDS.md` file.
+"""
+
+import argparse
+import os
+
+from custom_fields.custom_field_manager import (  # type: ignore[import]
+    CustomFieldManager,
+)
+from utils.utils import read_csv_file_to_dict  # type: ignore[import]
+
+
+def main(
+    path_to_custom_datasets_csv: str,
+    path_to_custom_fields_csv: str,
+    check_relations: bool,
+) -> None:
+    """Main function to run the custom fields script."""
+    # Get host and token from environment variables
+    # TODO: add option to load credentials from profile
+    # TODO: (refactor) credentials should be handled in one place for the project
+    host = os.environ.get("GDC_HOSTNAME")
+    token = os.environ.get("GDC_AUTH_TOKEN")
+
+    if not host:
+        raise ValueError("GDC_HOSTNAME environment variable is not set.")
+    if not token:
+        raise ValueError("GDC_AUTH_TOKEN environment variable is not set.")
+
+    # Load input data from csv files
+    custom_datasets: list[dict[str, str]] = read_csv_file_to_dict(
+        path_to_custom_datasets_csv
+    )
+    custom_fields: list[dict[str, str]] = read_csv_file_to_dict(
+        path_to_custom_fields_csv
+    )
+
+    # Create instance of CustomFieldManager with host and token
+    manager = CustomFieldManager(host, token)
+
+    # Process the custom datasets and fields
+    manager.process(custom_datasets, custom_fields, check_relations)
+
+
+def parse_args():
+    """Parse command line arguments."""
+    parser = argparse.ArgumentParser(description="Custom Fields Script")
+    parser.add_argument(
+        "path_to_custom_datasets_csv",
+        type=str,
+        help="Path to the CSV file containing custom datasets definitions.",
+    )
+    parser.add_argument(
+        "path_to_custom_fields_csv",
+        type=str,
+        help="Path to the CSV file containing custom fields definitions.",
+    )
+    parser.add_argument(
+        "--no-relations-check",
+        action="store_false",
+        dest="check_relations",
+        help="Check relations after updating LLM. "
+        + "If new ivalid relations are found, the update is rolled back. "
+        + "Boolean, defaults to True.",
+    )
+
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    args: argparse.Namespace = parse_args()
+    path_to_custom_datasets_csv = args.path_to_custom_datasets_csv
+    path_to_custom_fields_csv = args.path_to_custom_fields_csv
+    check_relations: bool = args.check_relations
+    main(path_to_custom_datasets_csv, path_to_custom_fields_csv, check_relations)
diff --git a/scripts/custom_fields/__init__.py b/scripts/custom_fields/__init__.py
@@ -0,0 +1 @@
+# (C) 2025 GoodData Corporation