gluent · nj1973 · Jun 4, 2026 · Jun 3, 2026 · Jun 3, 2026 · Jun 3, 2026
diff --git a/docs/internal/design_docs/connect_design_document.md b/docs/internal/design_docs/connect_design_document.md
@@ -0,0 +1,112 @@
+# Retrospective Design Document: Connect Environment Verification Tool
+
+## 1. Executive Summary & Scope
+
+The `connect` tool (`bin/connect`) is a post-installation user verification tool designed to validate that the Gluent Offload Engine (GOE) environment is correctly configured and healthy before running offload operations. It executes a suite of tests spanning configuration, database connectivity, and platform access, providing immediate visual feedback and structured exit codes.
+
+### Supported Topologies
+* **Frontend**: Oracle Database.
+* **Backend**: Google Cloud Platform (BigQuery).
+
+### Out of Scope Topologies
+While the underlying codebase contains legacy verification routines for other systems, the following targets are officially **out of scope** and disabled for this environment deployment:
+* **Backend**: Hadoop distributions (Cloudera, etc.), Snowflake, and Azure Synapse.
+* **Data Transport**: Apache Livy and Spark Thrift Server.
+
+---
+
+## 2. Architecture & High-Level Execution Flow
+
+`connect` is implemented in Python and triggered via a wrapper shell script.
+
+```mermaid
+graph TD
+    A[bin/connect wrapper] --> B[check_config_path]
+    B --> C[goe.connect.connect.connect]
+    C --> D[Load Environment offload.env]
+    D --> E[Parse CLI Options]
+    E --> F{Option: --upgrade-environment-file?}
+    F -- Yes --> G[Run upgrade_environment_file]
+    F -- No --> H[Run check_environment]
+    H --> I[Configuration Section]
+    I --> J[Frontend Section]
+    J --> K[Backend Section]
+    K --> L[Data Transport Section]
+    L --> M[Local Section]
+    M --> N[Report Results & Exit]
+```
+
+### Execution Phases
+1. **Initialization**: Checks for the existence of the `OFFLOAD_HOME` environment variable, loads the `offload.env` environment file, and parses CLI flags.
+2. **Configuration Validation**: Assesses the integrity and permissions of configuration files.
+3. **Frontend Verification**: Validates database connectivity, driver modes, and schema compatibility on the source database.
+4. **Backend Verification**: Validates access permissions, service APIs, and encryption assets on the target cloud platform.
+5. **Data Transport Verification**: Runs loopback tests to ensure the transport engine (Spark) can initiate bidirectional connectivity.
+6. **Local Verification**: Evaluates host operating system details and local helper scripts.
+
+---
+
+## 3. Detailed Verification Specifications
+
+### 3.1. Configuration Checks
+* **Template Delta Check**: Compares keys defined in the active `$OFFLOAD_HOME/conf/offload.env` against the relevant template (e.g., `oracle-bigquery-offload.env.template`).
+  * If keys exist in `offload.env` but not in the template (potentially deprecated configuration), a **Warning** is generated.
+  * If keys exist in the template but are missing in `offload.env`, a **Warning** is generated, prompting the user to upgrade.
+* **File Permissions Check**: Inspects `$OFFLOAD_HOME/conf/offload.env`. Expects permissions to be exactly `640` (owner read/write, group read, others no access). Deviations result in a **Warning**.
+
+### 3.2. Frontend Checks (Oracle)
+* **Connectivity Checks**:
+  * Attempts to connect as the application user (`rdbms_app_user`) and the administrative user (`ora_adm_user`).
+  * Supports connection via standard credentials or secure Oracle Wallet configurations.
+* **Driver Mode Check**: Validates the Python `oracledb` driver operational mode (Thin mode vs. Thick mode based on configuration).
+* **Database Version Check**: Asserts that the Oracle Database version is at or above the minimum supported version (`10.2.0.1`).
+* **Component Version Matching**: Compares the version of the installed database components (compiled schema packages) against the client binary version to ensure they are synchronized.
+* **Metadata & Parameters (Informational)**:
+  * Queries session attributes (`DB_UNIQUE_NAME`, `SESSION_USER`) and character sets (`NLS_CHARACTERSET`, `NLS_NCHAR_CHARACTERSET`).
+  * Queries database parameters: `processes`, `sessions`, `query_rewrite_enabled`, and `_optimizer_cartesian_enabled`.
+  * > [!NOTE]
+  * > The database parameters are collected for diagnostic and informational purposes only. No threshold assertions are applied. This section is slated for removal in a future release per **GitHub Issue #265**.
+
+### 3.3. Backend Checks (BigQuery)
+* **API Access & Connection**: Instantiates the BigQuery backend API client and queries the backend user identity to confirm that the BigQuery API is enabled and accessible.
+* **KMS Key Verification**: If a Cloud KMS key is configured for data encryption, `connect` queries the key status and metadata:
+  * Verifies the key state is `ENABLED`.
+  * Verifies the key purpose is `ENCRYPT_DECRYPT`.
+
+### 3.4. Data Transport Checks (Spark)
+Data transport validation tests that the data processing cluster can fetch data from the source Oracle database.
+* **GCP Dataproc / Dataproc Batches**:
+  * Validates availability of the `gcloud` command-line executable.
+  * Checks integration with Google Dataproc (clusters) or Google Dataproc Batches (serverless Spark).
+* **Generic Spark Submit**: Validates the availability of `spark-submit` executable if configured.
+* **RDBMS Connection Callback (Loopback Ping)**:
+  * For the active transport method, `connect` triggers a remote Spark task that attempts to establish a JDBC connection back to the source Oracle Database (`ping_source_rdbms()`).
+  * If the remote job cannot connect back (e.g., due to firewall blockages between the Dataproc network and the Oracle database host), the test reports a **Failure**.
+
+### 3.5. Local Checks
+* **OS Distribution & Kernel Check**: Reads `/etc/redhat-release`, `/etc/SuSE-release`, or `/etc/os-release` and executes `uname -r`. Fails if the operating system distribution is unrecognized.
+* **GEL Listener & Redis Cache (Disabled)**:
+  * > [!IMPORTANT]
+  * > The Gluent Event Listener (GEL) and Redis cache status checks are currently hard-disabled in the codebase pending **GitHub Issue #109**.
+
+---
+
+## 4. Diagnostics & Exit Strategy
+
+`connect` communicates test outcomes using console log severity colors (Passed in green, Warning in yellow, Failed in red) and reports execution health to callers via standard OS exit codes:
+
+| Exit Code | Severity | Description |
+| :--- | :--- | :--- |
+| **`0`** | Success | All executed checks completed successfully. |
+| **`1`** | Fatal | A critical/uncaught exception occurred (e.g., completely unable to connect to Oracle RDBMS during startup). |
+| **`2`** | Failure | One or more active validation checks failed (e.g., Dataproc callback failed, KMS key disabled). |
+| **`3`** | Warning | No failures occurred, but one or more warnings were detected (e.g., incorrect `offload.env` permissions). |
+
+---
+
+## 5. Operational Utilities
+
+### Configuration Upgrade (`--upgrade-environment-file`)
+To simplify transitions between Gluent software releases, `connect` provides an automated upgrade routine:
+* When executed with `--upgrade-environment-file`, it compares the user's active `offload.env` against the reference template.
+* It appends any newly introduced configuration parameters (along with default values and template documentation comments) to the end of the user's `offload.env`.
diff --git a/src/goe/offload/spark/dataproc_offload_transport.py b/src/goe/offload/spark/dataproc_offload_transport.py
@@ -47,6 +47,7 @@
 GCLOUD_PROPERTY_SEPARATOR = ",GSEP,"
 GCLOUD_BATCHES_STATE_CANCELLED = "CANCELLED"
 GCLOUD_BATCHES_STATE_FAILED = "FAILED"
+GCLOUD_BATCHES_STATE_MESSAGE_TASK_NOT_ACQUIRED = "Task was not acquired"
 
 
 class OffloadTransportSparkBatchesGcloud(OffloadTransportSpark):
@@ -362,13 +363,19 @@ def _verify_batch_describe_response(self, describe_output: str) -> bool:
             self.log(f"Exception describing Dataproc batch: {str(exc)}", detail=VERBOSE)
             return False
         state = reponse_dict.get("state")
+        state_message = reponse_dict.get("stateMessage", "")
         if state and state in [
             GCLOUD_BATCHES_STATE_CANCELLED,
             GCLOUD_BATCHES_STATE_FAILED,
         ]:
             if state == GCLOUD_BATCHES_STATE_CANCELLED:
                 raise OffloadTransportException(
-                    "Dataproc Batch is incomplete due to TTL, increase GOOGLE_DATAPROC_BATCHES_TTL"
+                    "Dataproc batch is incomplete due to TTL, increase GOOGLE_DATAPROC_BATCHES_TTL"
+                )
+            elif GCLOUD_BATCHES_STATE_MESSAGE_TASK_NOT_ACQUIRED in state_message:
+                raise OffloadTransportException(
+                    f"Dataproc batch failed with stateMessage containing '{GCLOUD_BATCHES_STATE_MESSAGE_TASK_NOT_ACQUIRED}'. "
+                    "The likely cause is missing VPC network/firewall prerequisites for Dataproc Serverless"
                 )
             else:
                 raise OffloadTransportException(

diff --git a/tests/integration/offload/test_backend_api.py b/tests/integration/offload/test_backend_api.py
@@ -37,7 +37,6 @@
 from tests.testlib.test_framework.test_functions import get_test_messages
 from tests.unit.offload.test_backend_api import TestBackendApi
 
-
 DIM_NAME = "INTEG_BACKEND_API_DIM"
 FACT_NAME = "INTEG_BACKEND_API_FACT"
 
@@ -89,17 +88,14 @@ def _create_test_tables(self):
             messages,
             frontend_api.standard_dimension_frontend_ddl(self.schema, DIM_NAME),
         )
-        # Ignore return status, if the table has already been offloaded previously then we'll re-use it.
-        try:
-            run_offload(
-                {
-                    "owner_table": self.schema + "." + self.table,
-                    "create_backend_db": True,
-                    "execute": True,
-                }
-            )
-        except OffloadException:
-            # If this one fails then we let the exception bubble up.
+        # If the table has already been offloaded previously then we'll re-use it.
+        if not run_offload(
+            {
+                "owner_table": self.schema + "." + self.table,
+                "create_backend_db": True,
+                "execute": True,
+            }
+        ):
             run_offload(
                 {
                     "owner_table": self.schema + "." + self.table,
@@ -118,16 +114,12 @@ def _create_test_tables(self):
                 self.schema, self.part_table, simple_partition_names=True
             ),
         )
-        # Ignore return status, if the table has already been offloaded previously then we'll re-use it.
-        try:
-            run_offload(
-                {
-                    "owner_table": self.schema + "." + self.part_table,
-                    "execute": True,
-                },
-            )
-        except OffloadException:
-            # If this one fails then we let the exception bubble up.
+        if not run_offload(
+            {
+                "owner_table": self.schema + "." + self.part_table,
+                "execute": True,
+            },
+        ):
             run_offload(
                 {
                     "owner_table": self.schema + "." + self.part_table,

diff --git a/tests/integration/offload/test_backend_table.py b/tests/integration/offload/test_backend_table.py
@@ -12,13 +12,14 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-""" TestBackendTable: Unit test library to test table level API for the configured backend.
-    Because there are so few table level methods that do not need a database we do not
-    skim all backends like we do for BackendApi testing.
-    A good number of methods are not unit tested because they need detailed inputs, such
-    as RDBMS columns, cast information, staging file details. For these we continue to
-    rely on integration tests.
+"""TestBackendTable: Unit test library to test table level API for the configured backend.
+Because there are so few table level methods that do not need a database we do not
+skim all backends like we do for BackendApi testing.
+A good number of methods are not unit tested because they need detailed inputs, such
+as RDBMS columns, cast information, staging file details. For these we continue to
+rely on integration tests.
 """
+
 from datetime import datetime
 import decimal
 import logging
@@ -57,7 +58,6 @@
     frontend_testing_api_factory,
 )
 
-
 FACT_NAME = "INTEG_BACKEND_TABLE_FACT"
 
 
@@ -142,17 +142,14 @@ def _create_test_table(self):
                 self.schema, FACT_NAME, simple_partition_names=True
             ),
         )
-        # Ignore return status, if the table has already been offloaded previously then we'll re-use it.
-        try:
-            run_offload(
-                {
-                    "owner_table": self.schema + "." + FACT_NAME,
-                    "create_backend_db": True,
-                    "execute": True,
-                }
-            )
-        except OffloadException:
-            # If this one fails then we let the exception bubble up.
+        # If the table has already been offloaded previously then we'll re-use it.
+        if not run_offload(
+            {
+                "owner_table": self.schema + "." + FACT_NAME,
+                "create_backend_db": True,
+                "execute": True,
+            }
+        ):
             run_offload(
                 {
                     "owner_table": self.schema + "." + FACT_NAME,

diff --git a/tests/integration/offload/test_predicate_offload.py b/tests/integration/offload/test_predicate_offload.py
@@ -15,7 +15,7 @@
 # limitations under the License.
 
 """
-    Offload predicate test code.
+Offload predicate test code.
 """
 
 from copy import copy
@@ -43,7 +43,6 @@
     get_test_messages,
 )
 
-
 DIM_NAME = "INTEG_PBO_DIM"
 FACT_NAME = "INTEG_PBO_FACT"
 
@@ -74,17 +73,14 @@ def create_and_offload_dim_table(config, frontend_api, messages, schema):
         messages,
         frontend_api.standard_dimension_frontend_ddl(schema, DIM_NAME),
     )
-    # Ignore return status, if the table has already been offloaded previously then we'll re-use it.
-    try:
-        run_offload(
-            {
-                "owner_table": schema + "." + DIM_NAME,
-                "create_backend_db": True,
-                "execute": True,
-            }
-        )
-    except OffloadException:
-        # If this one fails then we let the exception bubble up.
+    # If the table has already been offloaded previously then we'll re-use it.
+    if not run_offload(
+        {
+            "owner_table": schema + "." + DIM_NAME,
+            "create_backend_db": True,
+            "execute": True,
+        }
+    ):
         run_offload(
             {
                 "owner_table": schema + "." + DIM_NAME,

diff --git a/tests/unit/offload/test_offload_transport.py b/tests/unit/offload/test_offload_transport.py
@@ -43,7 +43,6 @@
     FAKE_ORACLE_BQ_ENV,
 )
 
-
 FRONTEND_COLUMNS = [
     OracleColumn(
         "COL_VARCHAR2",
@@ -356,6 +355,18 @@ def test_dataproc_batches_describe_cmd(config, messages, oracle_table, fake_oper
   "stateMessage": "Job failed with message [SyntaxError: invalid syntax]. Additional details can be found at:\\nhttps://console.cloud.google.com/dataproc/batches/west1/goe-batch-20240426080958?project=p\\ngcloud dataproc batches wait 'goe-batch-20240426080958' --region 'west1' --project 'p'\\nhttps://console.cloud.google.com/storage/browser/dataproc-staging-west1-123-l/batch-3347f/\\ngs://dataproc-staging-west1-123-l/google-cloud-dataproc-metainfo/2ad11/jobs/srvls-batch-3347f/driveroutput.*",
   "stateTime": "2024-04-26T10:27:48.237750Z",
   "uuid": "12cea"
+}""",
+            True,
+        ),
+        # Describe output for a job failing with 'Task was not acquired'.
+        (
+            """{
+    "createTime": "2026-05-13T08:05:33Z",
+    "creator": "sa@p.iam.gserviceaccount.com",
+    "name": "projects/p/locations/west1/batches/goe-batch-canary",
+    "state": "FAILED",
+    "stateMessage": "Task was not acquired.",
+    "uuid": "12cea"
 }""",
             True,
         ),