Quick Start Guide

🚀 Getting Started with the Spark to BCP Pipeline

This guide will help you get the pipeline running in 5 minutes.

Step 1: Deploy Azure Infrastructure

First, deploy the required Azure resources (Storage Account and SQL Database):

cd /home/kemack/github-projects/bcp-investigation
./infra/deploy.sh

This will:

Create Azure Storage Account with container
Create Azure SQL Database with table
Generate secure passwords
Update configuration files automatically

Step 2: Run Environment Setup

./scripts/setup_environment.sh

Step 3: Configure Your Azure Settings (Optional)

If you deployed using the infrastructure script, your config/azure_config.json is already configured!

For manual configuration, edit the file with your values:

{
  "storage_account": {
    "name": "YOUR_STORAGE_ACCOUNT_NAME",
    "container": "data-pipeline",
    "connection_string": "DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"
  },
  "sql_server": {
    "server": "your-server.database.windows.net",
    "database": "your_database",
    "username": "your_username",
    "password": "your_password",
    "table": "sales_data"
  }
}

Step 4: Run the Pipeline

# Run complete pipeline
./scripts/run_pipeline.sh

# Or run individual stages
./scripts/run_pipeline.sh spark-only
./scripts/run_pipeline.sh bcp-only

📊 What the Pipeline Does

Spark Processing (spark/data_processor.py):
- Loads sample sales data
- Generates additional test data (30 days worth)
- Aggregates data by category, region, and date
- Prepares data in BCP-compatible format
Azure Blob Export (spark/spark_to_blob.py):
- Exports processed data to Azure Blob Storage
- Uses pipe-delimited format optimal for BCP
- Creates BCP format files
- Includes comprehensive logging
BCP Import (bcp/bcp_import.sh):
- Downloads data from Azure Blob Storage
- Uses BCP utility to bulk import into SQL Server
- Includes error handling and validation
- Supports both direct URL and local file approaches

🔧 Configuration Options

Spark Configuration

Executor memory: 2GB (configurable)
Driver memory: 1GB (configurable)
Adaptive query execution enabled
Coalesces output to single file for BCP

BCP Configuration

Batch size: 10,000 records (configurable)
Field terminator: | (pipe)
Character encoding: UTF-8
Error logging enabled

Azure Configuration

Supports both storage account key and Azure AD authentication
Configurable container and blob paths
SAS token support for secure access

📁 Key Files

infra/main.bicep - Azure infrastructure template
infra/deploy.sh - Infrastructure deployment script
config/azure_config.json - Main configuration (auto-generated)
data/sample_data.csv - Sample input data
spark/data_processor.py - Data processing logic
spark/spark_to_blob.py - Spark to Blob export
sql/create_table.sql - SQL Server table schema (auto-executed)
bcp/bcp_import.sh - BCP import script
scripts/run_pipeline.sh - Complete pipeline runner

🐛 Troubleshooting

Common Issues

PySpark Import Errors:
- Ensure Java 11 is installed
- Activate virtual environment: source venv/bin/activate
BCP Not Found:
- Install SQL Server tools: sudo apt-get install mssql-tools18
- Add to PATH: export PATH="$PATH:/opt/mssql-tools18/bin"
Azure Blob Access Issues:
- Verify storage account credentials
- Check SAS token permissions and expiration
- Ensure container exists
SQL Server Connection Issues:
- Verify server name and credentials
- Check firewall settings
- Ensure target database and table exist

Logs and Debugging

Pipeline logs: pipeline_YYYYMMDD_HHMMSS.log
Spark logs: logs/spark_YYYYMMDD_HHMMSS.log
BCP logs: logs/bcp_YYYYMMDD_HHMMSS.log
Error logs: bcp_import_errors_YYYYMMDD_HHMMSS.log

📈 Scaling Considerations

For Large Datasets

Spark Scaling:
- Increase executor memory and cores
- Use cluster mode instead of local mode
- Partition data appropriately
BCP Optimization:
- Increase batch size for larger datasets
- Use multiple parallel BCP processes
- Consider table partitioning in SQL Server
Azure Blob Storage:
- Use multiple containers for partitioning
- Consider Azure Data Factory for orchestration
- Implement retry logic for transient failures

🔒 Security Best Practices

Use Azure AD Authentication instead of storage account keys
Store credentials in Azure Key Vault
Use managed identities when running on Azure
Implement least privilege access for all resources
Enable logging and monitoring for all components

📚 Additional Resources

🎯 Next Steps

After running the pipeline successfully:

Monitor the imported data in SQL Server
Validate infrastructure using ./infra/validate.sh
Schedule regular pipeline runs using cron or Azure Data Factory
Implement data quality checks and validation rules
Set up alerting for pipeline failures
Scale the solution based on your data volume requirements
Review security settings and consider Azure AD authentication for production

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick Start Guide

🚀 Getting Started with the Spark to BCP Pipeline

Step 1: Deploy Azure Infrastructure

Step 2: Run Environment Setup

Step 3: Configure Your Azure Settings (Optional)

Step 4: Run the Pipeline

📊 What the Pipeline Does

🔧 Configuration Options

Spark Configuration

BCP Configuration

Azure Configuration

📁 Key Files

🐛 Troubleshooting

Common Issues

Logs and Debugging

📈 Scaling Considerations

For Large Datasets

🔒 Security Best Practices

📚 Additional Resources

🎯 Next Steps

FilesExpand file tree

QUICK_START.md

Latest commit

History

QUICK_START.md

File metadata and controls

Quick Start Guide

🚀 Getting Started with the Spark to BCP Pipeline

Step 1: Deploy Azure Infrastructure

Step 2: Run Environment Setup

Step 3: Configure Your Azure Settings (Optional)

Step 4: Run the Pipeline

📊 What the Pipeline Does

🔧 Configuration Options

Spark Configuration

BCP Configuration

Azure Configuration

📁 Key Files

🐛 Troubleshooting

Common Issues

Logs and Debugging

📈 Scaling Considerations

For Large Datasets

🔒 Security Best Practices

📚 Additional Resources

🎯 Next Steps