Conversation
Pull from fivetran/fivetran_connector_sdk
There was a problem hiding this comment.
Pull request overview
This PR introduces a new Bright Data Web Scraper connector that syncs web scraping data from Bright Data's API to Fivetran destinations. The connector implements job triggering, snapshot polling, dynamic schema discovery, and data flattening capabilities.
Key changes:
- Implements a full-featured connector for Bright Data's Web Scraper REST API with polling-based asynchronous job processing
- Provides flexible URL input parsing supporting multiple formats (single URL, comma-separated, newline-separated, JSON array)
- Includes helper modules for validation, API interaction, data processing, and schema management
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 21 comments.
Show a summary per file
| File | Description |
|---|---|
connectors/bright_data_scrape/connector.py |
Main connector implementation with schema definition, update function, URL parsing, and result processing logic |
connectors/bright_data_scrape/configuration.json |
Configuration file with placeholder values for API token, dataset ID, and scrape URL |
connectors/bright_data_scrape/README.md |
Comprehensive documentation covering connector overview, configuration, authentication, data handling, error handling, and tables created |
connectors/bright_data_scrape/requirements.txt |
Python dependencies for dotenv and YAML support |
connectors/bright_data_scrape/helpers/validation.py |
Configuration validation logic for required parameters |
connectors/bright_data_scrape/helpers/scrape.py |
Core API interaction logic including job triggering, snapshot polling with retry logic and exponential backoff |
connectors/bright_data_scrape/helpers/schema_management.py |
Dynamic schema discovery and fields.yaml file management |
connectors/bright_data_scrape/helpers/data_processing.py |
Data flattening utilities for nested JSON structures |
connectors/bright_data_scrape/helpers/common.py |
Shared constants, error parsing, and response handling utilities |
connectors/bright_data_scrape/helpers/__init__.py |
Helper module exports for easy importing |
connectors/bright_data_scrape/.gitignore |
Git ignore rules for cache files, virtual environments, and generated files |
…c query parameters and improve error handling - Removed `_fivetran_synced` from primary keys in schema. - Added logging for JSON parsing errors in `parse_scrape_urls`. - Implemented dataset-specific query parameters in `sync_scrape_urls` for better API interaction. - Updated `process_and_upsert_results` to log primary key validation issues after processing. - Improved documentation in README.md regarding scrape_url configuration and additional query parameters. - Removed unused dependency `python-dotenv` from requirements.txt. - Enhanced helper functions for better error handling and response parsing.
|
Hi team — thanks for taking a look at this PR. Could you provide a rough timeline for when it might be reviewed or merged? It would help me plan the next steps for this feature. Happy to make any changes needed to move it forward! @fivetran-pranavtotala @fivetran-bhargavpatel |
…ing checks into a loop - Replaced individual validation checks for 'api_token', 'dataset_id', and 'scrape_url' with a single loop to enhance readability and maintainability. - Improved error messaging for missing configuration values.
…nagement functions - Replaced hardcoded 'tables' key with a constant for improved maintainability. - Updated logging messages to use the warning level for consistency.
- Introduced constants for maximum response size and chunk size to improve readability and maintainability. - Updated the `_parse_large_json_array_streaming` function to use the new constants. - Added a `max_attempts` parameter to the `_poll_snapshot` function to limit polling attempts and prevent infinite loops. - Improved error handling by raising a RuntimeError if polling exceeds the maximum attempts.
fivetran-pranavtotala
left a comment
There was a problem hiding this comment.
LGTM, add testing details and SS in the description and adress the comment.
fivetran-pranavtotala
left a comment
There was a problem hiding this comment.
Can you please add Fivetran debug sync screenshot if you cannot get a deployed sync run? rest looks good.
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
Co-authored-by: Dejan Tucakov <dejan.tucakov@fivetran.com>
| import json | ||
|
|
||
| # Helper functions for data processing, validation, and schema management | ||
| from helpers import ( |
There was a problem hiding this comment.
The import path is broken. The helpers package should be part of connectors/bright_data/bright_data_scrape/ folder or move the connector.py and other files to connectors/bright_data/.
|
|
||
| if isinstance(scrape_url_input, str): | ||
| # Try parsing as JSON first | ||
| parsed = json.loads(scrape_url_input) |
There was a problem hiding this comment.
This would fail if the input is just a simple string. Should we add a try/catch to handle that case?
| value = result.get(field) | ||
| # Explicitly ensure result_index is an integer before upsert | ||
| if field == "result_index": | ||
| if isinstance(value, str): |
There was a problem hiding this comment.
We are already explicitly changing the data type to int for result_index in previous loop. Why do we need to do it again?
| return [ | ||
| { | ||
| "table": __SCRAPE_TABLE, | ||
| "primary_key": [ |
There was a problem hiding this comment.
Let's also explicitly mention the data types for primary keys.
Refer this: https://github.com/fivetran/fivetran_connector_sdk/blob/main/connectors/apache_hbase/connector.py#L113
| urls = parse_scrape_urls(scrape_url_input) | ||
|
|
||
| if not urls: | ||
| log.warning("No URLs provided in configuration") |
There was a problem hiding this comment.
Let's change the log level to severe and also log the scrape_url_input use to get the URLs. It would be helpful for debugging.
|
|
||
| # Collect all fields and update schema documentation | ||
| all_fields = collect_all_fields(processed_results) | ||
| update_fields_yaml(all_fields, __SCRAPE_TABLE) |
There was a problem hiding this comment.
Why do we need to write the fields to a YAML file? You could just store in a global variable if required. I think we can remove the schema_management.py file and the related pyyaml dependency.
| keys_to_remove = [ | ||
| k | ||
| for k in flattened.keys() | ||
| if k.endswith(f"_{pk_field}") or k.startswith(f"{pk_field}_") |
There was a problem hiding this comment.
This logic removes any field ending or starting with primary_key_fields (e.g., _url or url_). What about legitimate fields like image_url?
| f"Failed to trigger Bright Data scrape after {retries} retries: {str(exc)}" | ||
| ) from exc | ||
|
|
||
| raise RuntimeError("Failed to trigger Bright Data scrape after retries") |
There was a problem hiding this comment.
This is unreachable.
| snapshot_id: str, | ||
| poll_interval: int, | ||
| timeout: int, | ||
| max_attempts: int = 1000, |
There was a problem hiding this comment.
The default value of max_attempts is very large. If it fails 1000 times and 30 seconds after each iteration, it would be stuck for 8+ hours.
| ) | ||
| return results | ||
|
|
||
| return None |
There was a problem hiding this comment.
If the response is [], then we would be returning None from here, which would cause the connector to sleep (line no. 397). It is a valid case, and the connector could be stuck.
We could have is None check and handle an empty response correctly.
fivetran-dejantucakov
left a comment
There was a problem hiding this comment.
Supporting documentation LGTM
Jira ticket
Closes
<ADD TICKET LINK HERE, EACH PR MUST BE LINKED TO A JIRA TICKET>Description of Change
This PR introduces a new Bright Data Web Scraper connector that syncs web scraping data from Bright Data's API to Fivetran destinations. The connector implements job triggering, snapshot polling, dynamic schema discovery, and data flattening capabilities.
Key changes:
Implements a full-featured connector for Bright Data's Web Scraper REST API with polling-based asynchronous job processing
Provides flexible URL input parsing supporting multiple formats (single URL, comma-separated, newline-separated, JSON array)
Includes helper modules for validation, API interaction, data processing, and schema management
Testing
Checklist
Some tips and links to help validate your PR:
fivetran debugcommand.