diff --git a/docs/internal/design_docs/connect_design_document.md b/docs/internal/design_docs/connect_design_document.md new file mode 100644 index 0000000..03ef02c --- /dev/null +++ b/docs/internal/design_docs/connect_design_document.md @@ -0,0 +1,112 @@ +# Retrospective Design Document: Connect Environment Verification Tool + +## 1. Executive Summary & Scope + +The `connect` tool (`bin/connect`) is a post-installation user verification tool designed to validate that the Gluent Offload Engine (GOE) environment is correctly configured and healthy before running offload operations. It executes a suite of tests spanning configuration, database connectivity, and platform access, providing immediate visual feedback and structured exit codes. + +### Supported Topologies +* **Frontend**: Oracle Database. +* **Backend**: Google Cloud Platform (BigQuery). + +### Out of Scope Topologies +While the underlying codebase contains legacy verification routines for other systems, the following targets are officially **out of scope** and disabled for this environment deployment: +* **Backend**: Hadoop distributions (Cloudera, etc.), Snowflake, and Azure Synapse. +* **Data Transport**: Apache Livy and Spark Thrift Server. + +--- + +## 2. Architecture & High-Level Execution Flow + +`connect` is implemented in Python and triggered via a wrapper shell script. + +```mermaid +graph TD + A[bin/connect wrapper] --> B[check_config_path] + B --> C[goe.connect.connect.connect] + C --> D[Load Environment offload.env] + D --> E[Parse CLI Options] + E --> F{Option: --upgrade-environment-file?} + F -- Yes --> G[Run upgrade_environment_file] + F -- No --> H[Run check_environment] + H --> I[Configuration Section] + I --> J[Frontend Section] + J --> K[Backend Section] + K --> L[Data Transport Section] + L --> M[Local Section] + M --> N[Report Results & Exit] +``` + +### Execution Phases +1. **Initialization**: Checks for the existence of the `OFFLOAD_HOME` environment variable, loads the `offload.env` environment file, and parses CLI flags. +2. **Configuration Validation**: Assesses the integrity and permissions of configuration files. +3. **Frontend Verification**: Validates database connectivity, driver modes, and schema compatibility on the source database. +4. **Backend Verification**: Validates access permissions, service APIs, and encryption assets on the target cloud platform. +5. **Data Transport Verification**: Runs loopback tests to ensure the transport engine (Spark) can initiate bidirectional connectivity. +6. **Local Verification**: Evaluates host operating system details and local helper scripts. + +--- + +## 3. Detailed Verification Specifications + +### 3.1. Configuration Checks +* **Template Delta Check**: Compares keys defined in the active `$OFFLOAD_HOME/conf/offload.env` against the relevant template (e.g., `oracle-bigquery-offload.env.template`). + * If keys exist in `offload.env` but not in the template (potentially deprecated configuration), a **Warning** is generated. + * If keys exist in the template but are missing in `offload.env`, a **Warning** is generated, prompting the user to upgrade. +* **File Permissions Check**: Inspects `$OFFLOAD_HOME/conf/offload.env`. Expects permissions to be exactly `640` (owner read/write, group read, others no access). Deviations result in a **Warning**. + +### 3.2. Frontend Checks (Oracle) +* **Connectivity Checks**: + * Attempts to connect as the application user (`rdbms_app_user`) and the administrative user (`ora_adm_user`). + * Supports connection via standard credentials or secure Oracle Wallet configurations. +* **Driver Mode Check**: Validates the Python `oracledb` driver operational mode (Thin mode vs. Thick mode based on configuration). +* **Database Version Check**: Asserts that the Oracle Database version is at or above the minimum supported version (`10.2.0.1`). +* **Component Version Matching**: Compares the version of the installed database components (compiled schema packages) against the client binary version to ensure they are synchronized. +* **Metadata & Parameters (Informational)**: + * Queries session attributes (`DB_UNIQUE_NAME`, `SESSION_USER`) and character sets (`NLS_CHARACTERSET`, `NLS_NCHAR_CHARACTERSET`). + * Queries database parameters: `processes`, `sessions`, `query_rewrite_enabled`, and `_optimizer_cartesian_enabled`. + * > [!NOTE] + * > The database parameters are collected for diagnostic and informational purposes only. No threshold assertions are applied. This section is slated for removal in a future release per **GitHub Issue #265**. + +### 3.3. Backend Checks (BigQuery) +* **API Access & Connection**: Instantiates the BigQuery backend API client and queries the backend user identity to confirm that the BigQuery API is enabled and accessible. +* **KMS Key Verification**: If a Cloud KMS key is configured for data encryption, `connect` queries the key status and metadata: + * Verifies the key state is `ENABLED`. + * Verifies the key purpose is `ENCRYPT_DECRYPT`. + +### 3.4. Data Transport Checks (Spark) +Data transport validation tests that the data processing cluster can fetch data from the source Oracle database. +* **GCP Dataproc / Dataproc Batches**: + * Validates availability of the `gcloud` command-line executable. + * Checks integration with Google Dataproc (clusters) or Google Dataproc Batches (serverless Spark). +* **Generic Spark Submit**: Validates the availability of `spark-submit` executable if configured. +* **RDBMS Connection Callback (Loopback Ping)**: + * For the active transport method, `connect` triggers a remote Spark task that attempts to establish a JDBC connection back to the source Oracle Database (`ping_source_rdbms()`). + * If the remote job cannot connect back (e.g., due to firewall blockages between the Dataproc network and the Oracle database host), the test reports a **Failure**. + +### 3.5. Local Checks +* **OS Distribution & Kernel Check**: Reads `/etc/redhat-release`, `/etc/SuSE-release`, or `/etc/os-release` and executes `uname -r`. Fails if the operating system distribution is unrecognized. +* **GEL Listener & Redis Cache (Disabled)**: + * > [!IMPORTANT] + * > The Gluent Event Listener (GEL) and Redis cache status checks are currently hard-disabled in the codebase pending **GitHub Issue #109**. + +--- + +## 4. Diagnostics & Exit Strategy + +`connect` communicates test outcomes using console log severity colors (Passed in green, Warning in yellow, Failed in red) and reports execution health to callers via standard OS exit codes: + +| Exit Code | Severity | Description | +| :--- | :--- | :--- | +| **`0`** | Success | All executed checks completed successfully. | +| **`1`** | Fatal | A critical/uncaught exception occurred (e.g., completely unable to connect to Oracle RDBMS during startup). | +| **`2`** | Failure | One or more active validation checks failed (e.g., Dataproc callback failed, KMS key disabled). | +| **`3`** | Warning | No failures occurred, but one or more warnings were detected (e.g., incorrect `offload.env` permissions). | + +--- + +## 5. Operational Utilities + +### Configuration Upgrade (`--upgrade-environment-file`) +To simplify transitions between Gluent software releases, `connect` provides an automated upgrade routine: +* When executed with `--upgrade-environment-file`, it compares the user's active `offload.env` against the reference template. +* It appends any newly introduced configuration parameters (along with default values and template documentation comments) to the end of the user's `offload.env`. diff --git a/src/goe/offload/spark/dataproc_offload_transport.py b/src/goe/offload/spark/dataproc_offload_transport.py index 62cdc20..d53cd0b 100644 --- a/src/goe/offload/spark/dataproc_offload_transport.py +++ b/src/goe/offload/spark/dataproc_offload_transport.py @@ -47,6 +47,7 @@ GCLOUD_PROPERTY_SEPARATOR = ",GSEP," GCLOUD_BATCHES_STATE_CANCELLED = "CANCELLED" GCLOUD_BATCHES_STATE_FAILED = "FAILED" +GCLOUD_BATCHES_STATE_MESSAGE_TASK_NOT_ACQUIRED = "Task was not acquired" class OffloadTransportSparkBatchesGcloud(OffloadTransportSpark): @@ -362,13 +363,19 @@ def _verify_batch_describe_response(self, describe_output: str) -> bool: self.log(f"Exception describing Dataproc batch: {str(exc)}", detail=VERBOSE) return False state = reponse_dict.get("state") + state_message = reponse_dict.get("stateMessage", "") if state and state in [ GCLOUD_BATCHES_STATE_CANCELLED, GCLOUD_BATCHES_STATE_FAILED, ]: if state == GCLOUD_BATCHES_STATE_CANCELLED: raise OffloadTransportException( - "Dataproc Batch is incomplete due to TTL, increase GOOGLE_DATAPROC_BATCHES_TTL" + "Dataproc batch is incomplete due to TTL, increase GOOGLE_DATAPROC_BATCHES_TTL" + ) + elif GCLOUD_BATCHES_STATE_MESSAGE_TASK_NOT_ACQUIRED in state_message: + raise OffloadTransportException( + f"Dataproc batch failed with stateMessage containing '{GCLOUD_BATCHES_STATE_MESSAGE_TASK_NOT_ACQUIRED}'. " + "The likely cause is missing VPC network/firewall prerequisites for Dataproc Serverless" ) else: raise OffloadTransportException( diff --git a/tests/integration/offload/test_backend_api.py b/tests/integration/offload/test_backend_api.py index 2c9fb0e..67f8148 100644 --- a/tests/integration/offload/test_backend_api.py +++ b/tests/integration/offload/test_backend_api.py @@ -37,7 +37,6 @@ from tests.testlib.test_framework.test_functions import get_test_messages from tests.unit.offload.test_backend_api import TestBackendApi - DIM_NAME = "INTEG_BACKEND_API_DIM" FACT_NAME = "INTEG_BACKEND_API_FACT" @@ -89,17 +88,14 @@ def _create_test_tables(self): messages, frontend_api.standard_dimension_frontend_ddl(self.schema, DIM_NAME), ) - # Ignore return status, if the table has already been offloaded previously then we'll re-use it. - try: - run_offload( - { - "owner_table": self.schema + "." + self.table, - "create_backend_db": True, - "execute": True, - } - ) - except OffloadException: - # If this one fails then we let the exception bubble up. + # If the table has already been offloaded previously then we'll re-use it. + if not run_offload( + { + "owner_table": self.schema + "." + self.table, + "create_backend_db": True, + "execute": True, + } + ): run_offload( { "owner_table": self.schema + "." + self.table, @@ -118,16 +114,12 @@ def _create_test_tables(self): self.schema, self.part_table, simple_partition_names=True ), ) - # Ignore return status, if the table has already been offloaded previously then we'll re-use it. - try: - run_offload( - { - "owner_table": self.schema + "." + self.part_table, - "execute": True, - }, - ) - except OffloadException: - # If this one fails then we let the exception bubble up. + if not run_offload( + { + "owner_table": self.schema + "." + self.part_table, + "execute": True, + }, + ): run_offload( { "owner_table": self.schema + "." + self.part_table, diff --git a/tests/integration/offload/test_backend_table.py b/tests/integration/offload/test_backend_table.py index b297b05..2ec2b89 100644 --- a/tests/integration/offload/test_backend_table.py +++ b/tests/integration/offload/test_backend_table.py @@ -12,13 +12,14 @@ # See the License for the specific language governing permissions and # limitations under the License. -""" TestBackendTable: Unit test library to test table level API for the configured backend. - Because there are so few table level methods that do not need a database we do not - skim all backends like we do for BackendApi testing. - A good number of methods are not unit tested because they need detailed inputs, such - as RDBMS columns, cast information, staging file details. For these we continue to - rely on integration tests. +"""TestBackendTable: Unit test library to test table level API for the configured backend. +Because there are so few table level methods that do not need a database we do not +skim all backends like we do for BackendApi testing. +A good number of methods are not unit tested because they need detailed inputs, such +as RDBMS columns, cast information, staging file details. For these we continue to +rely on integration tests. """ + from datetime import datetime import decimal import logging @@ -57,7 +58,6 @@ frontend_testing_api_factory, ) - FACT_NAME = "INTEG_BACKEND_TABLE_FACT" @@ -142,17 +142,14 @@ def _create_test_table(self): self.schema, FACT_NAME, simple_partition_names=True ), ) - # Ignore return status, if the table has already been offloaded previously then we'll re-use it. - try: - run_offload( - { - "owner_table": self.schema + "." + FACT_NAME, - "create_backend_db": True, - "execute": True, - } - ) - except OffloadException: - # If this one fails then we let the exception bubble up. + # If the table has already been offloaded previously then we'll re-use it. + if not run_offload( + { + "owner_table": self.schema + "." + FACT_NAME, + "create_backend_db": True, + "execute": True, + } + ): run_offload( { "owner_table": self.schema + "." + FACT_NAME, diff --git a/tests/integration/offload/test_predicate_offload.py b/tests/integration/offload/test_predicate_offload.py index eb335b8..0ddfc95 100644 --- a/tests/integration/offload/test_predicate_offload.py +++ b/tests/integration/offload/test_predicate_offload.py @@ -15,7 +15,7 @@ # limitations under the License. """ - Offload predicate test code. +Offload predicate test code. """ from copy import copy @@ -43,7 +43,6 @@ get_test_messages, ) - DIM_NAME = "INTEG_PBO_DIM" FACT_NAME = "INTEG_PBO_FACT" @@ -74,17 +73,14 @@ def create_and_offload_dim_table(config, frontend_api, messages, schema): messages, frontend_api.standard_dimension_frontend_ddl(schema, DIM_NAME), ) - # Ignore return status, if the table has already been offloaded previously then we'll re-use it. - try: - run_offload( - { - "owner_table": schema + "." + DIM_NAME, - "create_backend_db": True, - "execute": True, - } - ) - except OffloadException: - # If this one fails then we let the exception bubble up. + # If the table has already been offloaded previously then we'll re-use it. + if not run_offload( + { + "owner_table": schema + "." + DIM_NAME, + "create_backend_db": True, + "execute": True, + } + ): run_offload( { "owner_table": schema + "." + DIM_NAME, diff --git a/tests/unit/offload/test_offload_transport.py b/tests/unit/offload/test_offload_transport.py index 78068f1..0cf18ed 100644 --- a/tests/unit/offload/test_offload_transport.py +++ b/tests/unit/offload/test_offload_transport.py @@ -43,7 +43,6 @@ FAKE_ORACLE_BQ_ENV, ) - FRONTEND_COLUMNS = [ OracleColumn( "COL_VARCHAR2", @@ -356,6 +355,18 @@ def test_dataproc_batches_describe_cmd(config, messages, oracle_table, fake_oper "stateMessage": "Job failed with message [SyntaxError: invalid syntax]. Additional details can be found at:\\nhttps://console.cloud.google.com/dataproc/batches/west1/goe-batch-20240426080958?project=p\\ngcloud dataproc batches wait 'goe-batch-20240426080958' --region 'west1' --project 'p'\\nhttps://console.cloud.google.com/storage/browser/dataproc-staging-west1-123-l/batch-3347f/\\ngs://dataproc-staging-west1-123-l/google-cloud-dataproc-metainfo/2ad11/jobs/srvls-batch-3347f/driveroutput.*", "stateTime": "2024-04-26T10:27:48.237750Z", "uuid": "12cea" +}""", + True, + ), + # Describe output for a job failing with 'Task was not acquired'. + ( + """{ + "createTime": "2026-05-13T08:05:33Z", + "creator": "sa@p.iam.gserviceaccount.com", + "name": "projects/p/locations/west1/batches/goe-batch-canary", + "state": "FAILED", + "stateMessage": "Task was not acquired.", + "uuid": "12cea" }""", True, ),