Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 124 additions & 0 deletions docs/CUSTOM_FIELDS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Custom Field Management

The `scripts/custom_fields.py` script will allow you to extend the Logical Data Model (LDM) of a child workspace by adding extra datasets which are not present in the parent workspaces' LDM.

## Environment setup

The script relies on `GDC_HOSTNAME` and `GDC_AUTH_TOKEN` environment variables. You can export these by running this in your terminal:

```shell
export GDC_HOSTNAME=https://your-gooddata-cloud-domain.com
export GDC_AUTH_TOKEN=your-personal-access-token
```

## Input files

The script works with input from two CSV files. These files should contain (a) custom dataset definitions and (b) custom field definitions.

The custom dataset defines the dataset entity, i.e., the box you would see in the GoodData Cloud UI. The custom fields, on the other hand, define the individual fields in that dataset. You can imagine it as first defining a table and then its columns.

Multiple datasets and fields can be defined in the files. However, the files need to be consistent with each other - you cannot define fields form datasets that are not defined in the datasets file.

### Custom dataset definitions

The first contains the definitions of the datasets you want to create. It should have following structure:

| workspace_id | dataset_id | dataset_name | dataset_datasource_id | dataset_source_table | dataset_source_sql | parent_dataset_reference | parent_dataset_reference_attribute_id | dataset_reference_source_colum | wdf_id |
| -------------------- | ----------------- | -------------------- | --------------------- | -------------------- | ------------------ | ------------------------ | ------------------------------------- | ------------------------------ | ------ |
| child_workspace_id_1 | custom_dataset_id | Custom Dataset Title | datasource_id | dataset_source_table | | parent_dataset_id | parent_dataset.reference_field | custom_dataset.reference_field | wdf_id |

#### Validity constraints

- The `dataset_source_table` and `dataset_source_sql` are mutually exclusive. Only one of those should be filled in, the other should be null (empty value). In case both values are present, the script will throw an error.

- `workspace_id` + `dataset_id` combination must be unique across all dataset definitions.

#### JSON representation

For readability, I include the data structure in JSON with comments. However, note that the script will only work with CSV files!

```json
{
"workspace_id": "child_workspace_id_1", // child workspace id
"dataset_id": "custom_dataset_id", // custom dataset id
"dataset_name": "Custom Dataset Title", // custom dataset name
"dataset_datasource_id": "datasource_id", // data source id -> in the UI, you see it when you go to "manage files"
"dataset_source_table": "dataset_source_table", // the name of the table in the physical data model
"dataset_source_sql": null, // SQL query defining the dataset
"parent_dataset_reference": "products", // ID of the parent dataset to which the custom one will be connected
"parent_dataset_reference_attribute_id": "products.product_id", // parent dataset column name used fot the "join"
"dataset_reference_source_colum": "product_id", // custom dataset column name used for the "join"
"wdf_id": "x__client_id" // workspace data filter id
}
```

### Custom fields definition

The individual files of the custom dataset are defined thusly:

| workspace_id | dataset_id | cf_id | cf_name | cf_type | cf_source_column | cf_source_column_data_type |
| -------------------- | ----------------- | --------------- | ----------------- | --------- | -------------------------- | -------------------------- |
| child_workspace_id_1 | custom_dataset_id | custom_field_id | Custom Field Name | attribute | custom_field_source_column | INT |

#### Validity constraints

The custom field definitions must comply with these criteria:

- **attributes** and **facts**: unique `workspace_id` + `cf_id` combinations
- **dates**: unique `dataset_id` and `cf_id` combinations

#### JSON representation

Again, here is a JSON definition with comments for readability:

```json
{
"workspace_id": "child_workspace_id_1", // child workspace ID
"dataset_id": "custom_dataset_id", // custom dataset ID
"cf_id": "custom_field_id", // custom field ID
"cf_name": "Custom Field Name", // custom field name
"cf_type": "attribute", // GoodData type of the field*
"cf_source_column": "custom_field_source_column", // name of the column in the physical data model
"cf_source_column_data_type": "INT" // data type of the field*
}
```

\* Supported values of **_cf_type_** and **_cf_source_column_data_type_** are listed in `CustomFieldType` and `ColumnDataType` enums in [models](../scripts/custom_fields/models/custom_data_object.py)

## Usage

Now that your environment and input files are set up, let's have a look at how to run the script 🚀.

The script takes two positional arguments, which represent the paths to the input files we have discussed above.

```shell
python scripts/custom_fields.py custom_datasets.csv custom_fields.csv
```

There is also an optional flag: `--no-relations-check`. It's meaning is discussed in the next section.

### Check valid relations

Regardless of whether the flag is used or not, the script will always start by loading and validating the data from the provided files. The script will then iterate through workspaces.

#### If unused

If `--no-relations-check` is not used, the script will:

1. Store current workspace layout (analytical objects and LDM).
1. Check whether relations of metrics, visualizations and dashboards are valid. A set of current objects with invalid relations is created.
1. Push the updated LDM to GoodData Cloud.
1. Check object relations again. New set of objects with invalid relations is created.
1. The sets are compared.
- If there is more objects with invalid references in the new set, it means the objects were invalidated. Rollback is required.
- If the sets are not equal, rollback might be required
- If there is fewer invalid references or the sets are equal, rollback is not required
1. In case rollback is required, the initally stored workspace layout will be pushed to GoodData Cloud again, reverting changes to the workspace.

#### If used

If you decide to use the `--no-relations-check` flag, the script will simply validate the data and push the LDM extension to GoodData Cloud without any additional checks or rollbacks.

```shell
python scripts/custom_fields.py custom_datasets.csv custom_fields.csv --no-relations-check
```
2 changes: 1 addition & 1 deletion requirements-test.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
pytest~=7.3.2
moto~=4.1.11
moto~=5.1.6
pytest-mock==3.14.0
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
boto3==1.37.21
boto3==1.38.45
gooddata_sdk==1.39.0
requests==2.32.0
pydantic==2.11.3
79 changes: 79 additions & 0 deletions scripts/custom_fields.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# (C) 2025 GoodData Corporation
"""Top level script to manage custom datasets and fields in GoodData Cloud.

This script allows you to extend the Logical Data Model (LDM) of a child workspace.
Documentation and usage instructions are located in `docs/CUSTOM_FIELDS.md` file.
"""

import argparse
import os

from custom_fields.custom_field_manager import ( # type: ignore[import]
CustomFieldManager,
)
from utils.utils import read_csv_file_to_dict # type: ignore[import]


def main(
path_to_custom_datasets_csv: str,
path_to_custom_fields_csv: str,
check_relations: bool,
) -> None:
"""Main function to run the custom fields script."""
# Get host and token from environment variables
# TODO: add option to load credentials from profile
# TODO: (refactor) credentials should be handled in one place for the project
host = os.environ.get("GDC_HOSTNAME")
token = os.environ.get("GDC_AUTH_TOKEN")

if not host:
raise ValueError("GDC_HOSTNAME environment variable is not set.")
if not token:
raise ValueError("GDC_AUTH_TOKEN environment variable is not set.")

# Load input data from csv files
custom_datasets: list[dict[str, str]] = read_csv_file_to_dict(
path_to_custom_datasets_csv
)
custom_fields: list[dict[str, str]] = read_csv_file_to_dict(
path_to_custom_fields_csv
)

# Create instance of CustomFieldManager with host and token
manager = CustomFieldManager(host, token)

# Process the custom datasets and fields
manager.process(custom_datasets, custom_fields, check_relations)


def parse_args():
"""Parse command line arguments."""
parser = argparse.ArgumentParser(description="Custom Fields Script")
parser.add_argument(
"path_to_custom_datasets_csv",
type=str,
help="Path to the CSV file containing custom datasets definitions.",
)
parser.add_argument(
"path_to_custom_fields_csv",
type=str,
help="Path to the CSV file containing custom fields definitions.",
)
parser.add_argument(
"--no-relations-check",
action="store_false",
dest="check_relations",
help="Check relations after updating LLM. "
+ "If new ivalid relations are found, the update is rolled back. "
+ "Boolean, defaults to True.",
)

return parser.parse_args()


if __name__ == "__main__":
args: argparse.Namespace = parse_args()
path_to_custom_datasets_csv = args.path_to_custom_datasets_csv
path_to_custom_fields_csv = args.path_to_custom_fields_csv
check_relations: bool = args.check_relations
main(path_to_custom_datasets_csv, path_to_custom_fields_csv, check_relations)
1 change: 1 addition & 0 deletions scripts/custom_fields/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# (C) 2025 GoodData Corporation
Loading