Feat/bright data scrape by adomhamza · Pull Request #471 · fivetran/fivetran_connector_sdk

adomhamza · 2025-11-21T09:59:34Z

Jira ticket

Closes <ADD TICKET LINK HERE, EACH PR MUST BE LINKED TO A JIRA TICKET>

Description of Change

This PR introduces a new Bright Data Web Scraper connector that syncs web scraping data from Bright Data's API to Fivetran destinations. The connector implements job triggering, snapshot polling, dynamic schema discovery, and data flattening capabilities.

Key changes:

Implements a full-featured connector for Bright Data's Web Scraper REST API with polling-based asynchronous job processing
Provides flexible URL input parsing supporting multiple formats (single URL, comma-separated, newline-separated, JSON array)
Includes helper modules for validation, API interaction, data processing, and schema management

Testing

Checklist

Some tips and links to help validate your PR:

Tested the connector with fivetran debug command.
Added/Updated example specific README.md file, refer here for template.
Followed Python Coding Standards, refer here

Pull from fivetran/fivetran_connector_sdk

Copilot

Pull request overview

This PR introduces a new Bright Data Web Scraper connector that syncs web scraping data from Bright Data's API to Fivetran destinations. The connector implements job triggering, snapshot polling, dynamic schema discovery, and data flattening capabilities.

Key changes:

Implements a full-featured connector for Bright Data's Web Scraper REST API with polling-based asynchronous job processing
Provides flexible URL input parsing supporting multiple formats (single URL, comma-separated, newline-separated, JSON array)
Includes helper modules for validation, API interaction, data processing, and schema management

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 21 comments.

Show a summary per file

File	Description
`connectors/bright_data_scrape/connector.py`	Main connector implementation with schema definition, update function, URL parsing, and result processing logic
`connectors/bright_data_scrape/configuration.json`	Configuration file with placeholder values for API token, dataset ID, and scrape URL
`connectors/bright_data_scrape/README.md`	Comprehensive documentation covering connector overview, configuration, authentication, data handling, error handling, and tables created
`connectors/bright_data_scrape/requirements.txt`	Python dependencies for dotenv and YAML support
`connectors/bright_data_scrape/helpers/validation.py`	Configuration validation logic for required parameters
`connectors/bright_data_scrape/helpers/scrape.py`	Core API interaction logic including job triggering, snapshot polling with retry logic and exponential backoff
`connectors/bright_data_scrape/helpers/schema_management.py`	Dynamic schema discovery and fields.yaml file management
`connectors/bright_data_scrape/helpers/data_processing.py`	Data flattening utilities for nested JSON structures
`connectors/bright_data_scrape/helpers/common.py`	Shared constants, error parsing, and response handling utilities
`connectors/bright_data_scrape/helpers/__init__.py`	Helper module exports for easy importing
`connectors/bright_data_scrape/.gitignore`	Git ignore rules for cache files, virtual environments, and generated files

…on control

…c query parameters and improve error handling - Removed `_fivetran_synced` from primary keys in schema. - Added logging for JSON parsing errors in `parse_scrape_urls`. - Implemented dataset-specific query parameters in `sync_scrape_urls` for better API interaction. - Updated `process_and_upsert_results` to log primary key validation issues after processing. - Improved documentation in README.md regarding scrape_url configuration and additional query parameters. - Removed unused dependency `python-dotenv` from requirements.txt. - Enhanced helper functions for better error handling and response parsing.

adomhamza · 2025-12-03T15:03:11Z

Hi team — thanks for taking a look at this PR. Could you provide a rough timeline for when it might be reviewed or merged? It would help me plan the next steps for this feature. Happy to make any changes needed to move it forward! @fivetran-pranavtotala @fivetran-bhargavpatel

…ing checks into a loop - Replaced individual validation checks for 'api_token', 'dataset_id', and 'scrape_url' with a single loop to enhance readability and maintainability. - Improved error messaging for missing configuration values.

…nagement functions - Replaced hardcoded 'tables' key with a constant for improved maintainability. - Updated logging messages to use the warning level for consistency.

- Introduced constants for maximum response size and chunk size to improve readability and maintainability. - Updated the `_parse_large_json_array_streaming` function to use the new constants. - Added a `max_attempts` parameter to the `_poll_snapshot` function to limit polling attempts and prevent infinite loops. - Improved error handling by raising a RuntimeError if polling exceeds the maximum attempts.

fivetran-pranavtotala

LGTM, add testing details and SS in the description and adress the comment.

fivetran-pranavtotala

Can you please add Fivetran debug sync screenshot if you cannot get a deployed sync run? rest looks good.

…in README.md

Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>

…a connector

fivetran-satvikpatil · 2026-03-02T14:10:51Z

+import json
+
+# Helper functions for data processing, validation, and schema management
+from helpers import (


The import path is broken. The helpers package should be part of connectors/bright_data/bright_data_scrape/ folder or move the connector.py and other files to connectors/bright_data/.

fivetran-satvikpatil · 2026-03-02T14:14:41Z

+
+    if isinstance(scrape_url_input, str):
+        # Try parsing as JSON first
+        parsed = json.loads(scrape_url_input)


This would fail if the input is just a simple string. Should we add a try/catch to handle that case?

fivetran-satvikpatil · 2026-03-02T14:23:47Z

+            value = result.get(field)
+            # Explicitly ensure result_index is an integer before upsert
+            if field == "result_index":
+                if isinstance(value, str):


We are already explicitly changing the data type to int for result_index in previous loop. Why do we need to do it again?

fivetran-satvikpatil · 2026-03-02T14:26:16Z

+    return [
+        {
+            "table": __SCRAPE_TABLE,
+            "primary_key": [


Let's also explicitly mention the data types for primary keys.
Refer this: https://github.com/fivetran/fivetran_connector_sdk/blob/main/connectors/apache_hbase/connector.py#L113

fivetran-satvikpatil · 2026-03-02T14:30:38Z

+    urls = parse_scrape_urls(scrape_url_input)
+
+    if not urls:
+        log.warning("No URLs provided in configuration")


Let's change the log level to severe and also log the scrape_url_input use to get the URLs. It would be helpful for debugging.

fivetran-satvikpatil · 2026-03-02T14:33:55Z

+
+    # Collect all fields and update schema documentation
+    all_fields = collect_all_fields(processed_results)
+    update_fields_yaml(all_fields, __SCRAPE_TABLE)


Why do we need to write the fields to a YAML file? You could just store in a global variable if required. I think we can remove the schema_management.py file and the related pyyaml dependency.

fivetran-satvikpatil · 2026-03-02T14:43:04Z

+        keys_to_remove = [
+            k
+            for k in flattened.keys()
+            if k.endswith(f"_{pk_field}") or k.startswith(f"{pk_field}_")


This logic removes any field ending or starting with primary_key_fields (e.g., _url or url_). What about legitimate fields like image_url?

fivetran-satvikpatil · 2026-03-02T14:46:41Z

+                f"Failed to trigger Bright Data scrape after {retries} retries: {str(exc)}"
+            ) from exc
+
+    raise RuntimeError("Failed to trigger Bright Data scrape after retries")


This is unreachable.

fivetran-satvikpatil · 2026-03-02T14:49:16Z

+    snapshot_id: str,
+    poll_interval: int,
+    timeout: int,
+    max_attempts: int = 1000,


The default value of max_attempts is very large. If it fails 1000 times and 30 seconds after each iteration, it would be stuck for 8+ hours.

fivetran-satvikpatil · 2026-03-02T14:57:02Z

+        )
+        return results
+
+    return None


If the response is [], then we would be returning None from here, which would cause the connector to sleep (line no. 397). It is a valid case, and the connector could be stuck.
We could have is None check and handle an empty response correctly.

fivetran-dejantucakov

Supporting documentation LGTM

adomhamza and others added 2 commits November 20, 2025 12:08

Merge pull request #7 from fivetran/main

bf0f182

Pull from fivetran/fivetran_connector_sdk

feature(examples): added Bright Data Web Scraper connector example

31adb71

adomhamza requested review from a team as code owners November 21, 2025 09:59

github-actions Bot added the size/XXL PR size: extra extra large label Nov 21, 2025

fivetran-rishabhghosh requested a review from Copilot November 23, 2025 18:32

Copilot started reviewing on behalf of fivetran-rishabhghosh November 23, 2025 18:32 View session

Copilot finished reviewing on behalf of fivetran-rishabhghosh November 23, 2025 18:36

Copilot AI reviewed Nov 23, 2025

View reviewed changes

adomhamza added 2 commits November 24, 2025 13:54

fix(.gitignore): Un-comment configuration.json to include it in versi…

d947ea2

…on control

fivetran-rishabhghosh requested review from fivetran-bhargavpatel and fivetran-pranavtotala November 24, 2025 15:34

fivetran-rishabhghosh requested a review from fivetran-sahilkhirwal December 9, 2025 19:08

fivetran-bhargavpatel reviewed Dec 10, 2025

View reviewed changes

Comment thread connectors/bright_data_scrape/helpers/validation.py Outdated

fivetran-bhargavpatel reviewed Dec 10, 2025

View reviewed changes

Comment thread connectors/bright_data_scrape/helpers/schema_management.py Outdated

fivetran-bhargavpatel reviewed Dec 10, 2025

View reviewed changes

Comment thread connectors/bright_data_scrape/helpers/schema_management.py Outdated

fivetran-bhargavpatel reviewed Dec 10, 2025

View reviewed changes

Comment thread connectors/bright_data_scrape/helpers/schema_management.py Outdated

fivetran-bhargavpatel reviewed Dec 10, 2025

View reviewed changes

Comment thread connectors/bright_data_scrape/helpers/scrape.py Outdated

fivetran-bhargavpatel reviewed Dec 10, 2025

View reviewed changes

Comment thread connectors/bright_data_scrape/helpers/scrape.py Outdated

fivetran-bhargavpatel reviewed Dec 10, 2025

View reviewed changes

Comment thread connectors/bright_data_scrape/helpers/scrape.py Outdated

adomhamza added 3 commits December 10, 2025 10:16

refactor(schema_management): Standardize table key usage in schema ma…

32181ba

…nagement functions - Replaced hardcoded 'tables' key with a constant for improved maintainability. - Updated logging messages to use the warning level for consistency.

adomhamza requested a review from fivetran-bhargavpatel December 10, 2025 10:49

fivetran-pranavtotala requested changes Dec 11, 2025

View reviewed changes

Comment thread connectors/bright_data_scrape/.gitignore Outdated

chore(.gitignore): Remove .gitignore file

8ffe13f

adomhamza requested a review from fivetran-pranavtotala December 11, 2025 11:25

fivetran-pranavtotala reviewed Dec 11, 2025

View reviewed changes

chore: remove helper files reference from bright_data_scrape section …

724234e

…in README.md

adomhamza dismissed fivetran-pranavtotala’s stale review via 724234e March 2, 2026 10:30

fivetran-satvikpatil self-requested a review March 2, 2026 10:33