Skip to content

Feat/bright data scrape#471

Open
adomhamza wants to merge 31 commits intofivetran:mainfrom
ReallyGreatTech:feat/bright_data_scrape
Open

Feat/bright data scrape#471
adomhamza wants to merge 31 commits intofivetran:mainfrom
ReallyGreatTech:feat/bright_data_scrape

Conversation

@adomhamza
Copy link
Copy Markdown

@adomhamza adomhamza commented Nov 21, 2025

Jira ticket

Closes <ADD TICKET LINK HERE, EACH PR MUST BE LINKED TO A JIRA TICKET>

Description of Change

This PR introduces a new Bright Data Web Scraper connector that syncs web scraping data from Bright Data's API to Fivetran destinations. The connector implements job triggering, snapshot polling, dynamic schema discovery, and data flattening capabilities.

Key changes:

Implements a full-featured connector for Bright Data's Web Scraper REST API with polling-based asynchronous job processing
Provides flexible URL input parsing supporting multiple formats (single URL, comma-separated, newline-separated, JSON array)
Includes helper modules for validation, API interaction, data processing, and schema management

Testing

image image image

Checklist

Some tips and links to help validate your PR:

  • Tested the connector with fivetran debug command.
  • Added/Updated example specific README.md file, refer here for template.
  • Followed Python Coding Standards, refer here

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new Bright Data Web Scraper connector that syncs web scraping data from Bright Data's API to Fivetran destinations. The connector implements job triggering, snapshot polling, dynamic schema discovery, and data flattening capabilities.

Key changes:

  • Implements a full-featured connector for Bright Data's Web Scraper REST API with polling-based asynchronous job processing
  • Provides flexible URL input parsing supporting multiple formats (single URL, comma-separated, newline-separated, JSON array)
  • Includes helper modules for validation, API interaction, data processing, and schema management

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 21 comments.

Show a summary per file
File Description
connectors/bright_data_scrape/connector.py Main connector implementation with schema definition, update function, URL parsing, and result processing logic
connectors/bright_data_scrape/configuration.json Configuration file with placeholder values for API token, dataset ID, and scrape URL
connectors/bright_data_scrape/README.md Comprehensive documentation covering connector overview, configuration, authentication, data handling, error handling, and tables created
connectors/bright_data_scrape/requirements.txt Python dependencies for dotenv and YAML support
connectors/bright_data_scrape/helpers/validation.py Configuration validation logic for required parameters
connectors/bright_data_scrape/helpers/scrape.py Core API interaction logic including job triggering, snapshot polling with retry logic and exponential backoff
connectors/bright_data_scrape/helpers/schema_management.py Dynamic schema discovery and fields.yaml file management
connectors/bright_data_scrape/helpers/data_processing.py Data flattening utilities for nested JSON structures
connectors/bright_data_scrape/helpers/common.py Shared constants, error parsing, and response handling utilities
connectors/bright_data_scrape/helpers/__init__.py Helper module exports for easy importing
connectors/bright_data_scrape/.gitignore Git ignore rules for cache files, virtual environments, and generated files

Comment thread connectors/bright_data/bright_data_scrape/connector.py
Comment thread connectors/bright_data/bright_data_scrape/connector.py
Comment thread connectors/bright_data_scrape/helpers/common.py Outdated
Comment thread connectors/bright_data_scrape/helpers/data_processing.py Outdated
Comment thread connectors/bright_data_scrape/.gitignore Outdated
Comment thread connectors/bright_data_scrape/helpers/common.py Outdated
Comment thread connectors/bright_data_scrape/helpers/data_processing.py
Comment thread connectors/bright_data/bright_data_scrape/README.md
Comment thread connectors/bright_data_scrape/connector.py Outdated
Comment thread connectors/bright_data_scrape/helpers/scrape.py Outdated
…c query parameters and improve error handling

- Removed `_fivetran_synced` from primary keys in schema.
- Added logging for JSON parsing errors in `parse_scrape_urls`.
- Implemented dataset-specific query parameters in `sync_scrape_urls` for better API interaction.
- Updated `process_and_upsert_results` to log primary key validation issues after processing.
- Improved documentation in README.md regarding scrape_url configuration and additional query parameters.
- Removed unused dependency `python-dotenv` from requirements.txt.
- Enhanced helper functions for better error handling and response parsing.
@adomhamza
Copy link
Copy Markdown
Author

Hi team — thanks for taking a look at this PR. Could you provide a rough timeline for when it might be reviewed or merged? It would help me plan the next steps for this feature. Happy to make any changes needed to move it forward! @fivetran-pranavtotala @fivetran-bhargavpatel

Comment thread connectors/bright_data_scrape/helpers/validation.py Outdated
Comment thread connectors/bright_data_scrape/helpers/schema_management.py Outdated
Comment thread connectors/bright_data_scrape/helpers/schema_management.py Outdated
Comment thread connectors/bright_data_scrape/helpers/schema_management.py Outdated
Comment thread connectors/bright_data_scrape/helpers/scrape.py Outdated
Comment thread connectors/bright_data_scrape/helpers/scrape.py Outdated
Comment thread connectors/bright_data_scrape/helpers/scrape.py Outdated
…ing checks into a loop

- Replaced individual validation checks for 'api_token', 'dataset_id', and 'scrape_url' with a single loop to enhance readability and maintainability.
- Improved error messaging for missing configuration values.
…nagement functions

- Replaced hardcoded 'tables' key with a constant for improved maintainability.
- Updated logging messages to use the warning level for consistency.
- Introduced constants for maximum response size and chunk size to improve readability and maintainability.
- Updated the `_parse_large_json_array_streaming` function to use the new constants.
- Added a `max_attempts` parameter to the `_poll_snapshot` function to limit polling attempts and prevent infinite loops.
- Improved error handling by raising a RuntimeError if polling exceeds the maximum attempts.
Copy link
Copy Markdown
Contributor

@fivetran-pranavtotala fivetran-pranavtotala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, add testing details and SS in the description and adress the comment.

Comment thread connectors/bright_data_scrape/.gitignore Outdated
Copy link
Copy Markdown
Contributor

@fivetran-pranavtotala fivetran-pranavtotala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add Fivetran debug sync screenshot if you cannot get a deployed sync run? rest looks good.

Comment thread connectors/bright_data/bright_data_scrape/README.md Outdated
Comment thread connectors/bright_data/bright_data_scrape/README.md Outdated
Comment thread connectors/bright_data/bright_data_scrape/README.md Outdated
Comment thread connectors/bright_data/bright_data_scrape/README.md Outdated
Comment thread connectors/bright_data/bright_data_scrape/README.md Outdated
Comment thread connectors/bright_data/bright_data_scrape/README.md Outdated
Comment thread connectors/bright_data/bright_data_scrape/README.md Outdated
Comment thread connectors/bright_data/bright_data_scrape/README.md Outdated
Comment thread connectors/bright_data/bright_data_scrape/README.md Outdated
Comment thread README.md Outdated
adomhamza and others added 11 commits March 2, 2026 12:14
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
import json

# Helper functions for data processing, validation, and schema management
from helpers import (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import path is broken. The helpers package should be part of connectors/bright_data/bright_data_scrape/ folder or move the connector.py and other files to connectors/bright_data/.


if isinstance(scrape_url_input, str):
# Try parsing as JSON first
parsed = json.loads(scrape_url_input)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would fail if the input is just a simple string. Should we add a try/catch to handle that case?

value = result.get(field)
# Explicitly ensure result_index is an integer before upsert
if field == "result_index":
if isinstance(value, str):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already explicitly changing the data type to int for result_index in previous loop. Why do we need to do it again?

return [
{
"table": __SCRAPE_TABLE,
"primary_key": [
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also explicitly mention the data types for primary keys.
Refer this: https://github.com/fivetran/fivetran_connector_sdk/blob/main/connectors/apache_hbase/connector.py#L113

urls = parse_scrape_urls(scrape_url_input)

if not urls:
log.warning("No URLs provided in configuration")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change the log level to severe and also log the scrape_url_input use to get the URLs. It would be helpful for debugging.


# Collect all fields and update schema documentation
all_fields = collect_all_fields(processed_results)
update_fields_yaml(all_fields, __SCRAPE_TABLE)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to write the fields to a YAML file? You could just store in a global variable if required. I think we can remove the schema_management.py file and the related pyyaml dependency.

keys_to_remove = [
k
for k in flattened.keys()
if k.endswith(f"_{pk_field}") or k.startswith(f"{pk_field}_")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic removes any field ending or starting with primary_key_fields (e.g., _url or url_). What about legitimate fields like image_url?

f"Failed to trigger Bright Data scrape after {retries} retries: {str(exc)}"
) from exc

raise RuntimeError("Failed to trigger Bright Data scrape after retries")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unreachable.

snapshot_id: str,
poll_interval: int,
timeout: int,
max_attempts: int = 1000,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value of max_attempts is very large. If it fails 1000 times and 30 seconds after each iteration, it would be stuck for 8+ hours.

)
return results

return None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the response is [], then we would be returning None from here, which would cause the connector to sleep (line no. 397). It is a valid case, and the connector could be stuck.
We could have is None check and handle an empty response correctly.

Copy link
Copy Markdown
Contributor

@fivetran-dejantucakov fivetran-dejantucakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting documentation LGTM

@fivetran-dejantucakov fivetran-dejantucakov self-assigned this Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL PR size: extra extra large

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants