This guide will help you get the pipeline running in 5 minutes.
First, deploy the required Azure resources (Storage Account and SQL Database):
cd /home/kemack/github-projects/bcp-investigation
./infra/deploy.shThis will:
- Create Azure Storage Account with container
- Create Azure SQL Database with table
- Generate secure passwords
- Update configuration files automatically
./scripts/setup_environment.shIf you deployed using the infrastructure script, your config/azure_config.json is already configured!
For manual configuration, edit the file with your values:
{
"storage_account": {
"name": "YOUR_STORAGE_ACCOUNT_NAME",
"container": "data-pipeline",
"connection_string": "DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"
},
"sql_server": {
"server": "your-server.database.windows.net",
"database": "your_database",
"username": "your_username",
"password": "your_password",
"table": "sales_data"
}
}# Run complete pipeline
./scripts/run_pipeline.sh
# Or run individual stages
./scripts/run_pipeline.sh spark-only
./scripts/run_pipeline.sh bcp-only-
Spark Processing (
spark/data_processor.py):- Loads sample sales data
- Generates additional test data (30 days worth)
- Aggregates data by category, region, and date
- Prepares data in BCP-compatible format
-
Azure Blob Export (
spark/spark_to_blob.py):- Exports processed data to Azure Blob Storage
- Uses pipe-delimited format optimal for BCP
- Creates BCP format files
- Includes comprehensive logging
-
BCP Import (
bcp/bcp_import.sh):- Downloads data from Azure Blob Storage
- Uses BCP utility to bulk import into SQL Server
- Includes error handling and validation
- Supports both direct URL and local file approaches
- Executor memory: 2GB (configurable)
- Driver memory: 1GB (configurable)
- Adaptive query execution enabled
- Coalesces output to single file for BCP
- Batch size: 10,000 records (configurable)
- Field terminator:
|(pipe) - Character encoding: UTF-8
- Error logging enabled
- Supports both storage account key and Azure AD authentication
- Configurable container and blob paths
- SAS token support for secure access
infra/main.bicep- Azure infrastructure templateinfra/deploy.sh- Infrastructure deployment scriptconfig/azure_config.json- Main configuration (auto-generated)data/sample_data.csv- Sample input dataspark/data_processor.py- Data processing logicspark/spark_to_blob.py- Spark to Blob exportsql/create_table.sql- SQL Server table schema (auto-executed)bcp/bcp_import.sh- BCP import scriptscripts/run_pipeline.sh- Complete pipeline runner
-
PySpark Import Errors:
- Ensure Java 11 is installed
- Activate virtual environment:
source venv/bin/activate
-
BCP Not Found:
- Install SQL Server tools:
sudo apt-get install mssql-tools18 - Add to PATH:
export PATH="$PATH:/opt/mssql-tools18/bin"
- Install SQL Server tools:
-
Azure Blob Access Issues:
- Verify storage account credentials
- Check SAS token permissions and expiration
- Ensure container exists
-
SQL Server Connection Issues:
- Verify server name and credentials
- Check firewall settings
- Ensure target database and table exist
- Pipeline logs:
pipeline_YYYYMMDD_HHMMSS.log - Spark logs:
logs/spark_YYYYMMDD_HHMMSS.log - BCP logs:
logs/bcp_YYYYMMDD_HHMMSS.log - Error logs:
bcp_import_errors_YYYYMMDD_HHMMSS.log
-
Spark Scaling:
- Increase executor memory and cores
- Use cluster mode instead of local mode
- Partition data appropriately
-
BCP Optimization:
- Increase batch size for larger datasets
- Use multiple parallel BCP processes
- Consider table partitioning in SQL Server
-
Azure Blob Storage:
- Use multiple containers for partitioning
- Consider Azure Data Factory for orchestration
- Implement retry logic for transient failures
- Use Azure AD Authentication instead of storage account keys
- Store credentials in Azure Key Vault
- Use managed identities when running on Azure
- Implement least privilege access for all resources
- Enable logging and monitoring for all components
After running the pipeline successfully:
- Monitor the imported data in SQL Server
- Validate infrastructure using
./infra/validate.sh - Schedule regular pipeline runs using cron or Azure Data Factory
- Implement data quality checks and validation rules
- Set up alerting for pipeline failures
- Scale the solution based on your data volume requirements
- Review security settings and consider Azure AD authentication for production