Spark to Azure Blob to SQL Server BCP Pipeline

This project demonstrates a complete data pipeline that:

Uses Apache Spark to process sample data
Exports processed data to Azure Blob Storage in BCP-compatible format
Uses BCP utility to bulk import data into SQL Server

Project Structure

bcp-investigation/
├── README.md
├── requirements.txt
├── config/
│   ├── spark_config.py
│   └── azure_config.json
├── data/
│   └── sample_data.csv
├── spark/
│   ├── data_processor.py
│   └── spark_to_blob.py
├── sql/
│   ├── create_table.sql
│   └── format_file.fmt
├── bcp/
│   ├── bcp_import.sh
│   └── bcp_import.ps1
├── infra/
│   ├── main.bicep
│   ├── main.parameters.json
│   ├── deploy.sh
│   └── README.md
└── scripts/
    ├── run_pipeline.sh
    └── setup_environment.sh

Prerequisites

Azure Resources:
- Azure Storage Account
- Azure SQL Database or SQL Server instance
- Service Principal (for authentication)
Software Requirements:
- Python 3.8+
- Apache Spark 3.x
- SQL Server BCP utility
- Azure CLI
Create Configuration / Parameters files: ''' cp ./infra/main.parameters.json.copy ./infra/main.parameters.json cp ./config/azure_config.json.copy ./config/azure_config.json '''

Install Azure CLI:

curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

Install BCP:

chmod +x ./scripts/install_bcp.sh
bash ./scripts/install_bcp.sh

Setup Instructions

**For Quick Start Information, and information on local configuration.

Deploy Azure Infrastructure: For more information see infra README.md.
```
chmod +x infra/deploy.sh
./infra/deploy.sh
```

Setup Local Environment:

chmod +x scripts/setup_environment.sh
./scripts/setup_environment.sh

NOTE: Manual Updates in the Azure Portal The following are the manual updates:

Update the SQL Server to enable Entra ID, adding your current user as the admin for SQL.
Make sure to open the Database, and run the query found at "./sql/create_table.sql" to create the table for data to land in.
The user account being used to run the scripts will need azure permissions to the blob storage account. Specifically "Storage Blob Data Contributor"

Run the Pipeline:

chmod +x scripts/run_pipeline.sh
./scripts/run_pipeline.sh

NOTE: To clean up all log files generated in execution, you can run the script '''bash ./scripts/clear_log_files.sh'''.

Pipeline Flow

Data Generation: Creates sample sales data
Spark Processing: Transforms and aggregates data
Blob Export: Writes data to Azure Blob Storage in pipe-delimited format
BCP Import: Bulk imports data into SQL Server using BCP utility

Configuration

Edit config/azure_config.json to match your Azure environment:

Storage account details
SQL Server connection information
Authentication credentials

Monitoring and Logging

All operations include comprehensive logging:

Spark job logs
BCP operation logs
Error handling and retry logic

Performance Considerations

File sizes optimized for BCP performance
Partitioning strategy for large datasets
Connection pooling and timeout settings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark to Azure Blob to SQL Server BCP Pipeline

Project Structure

Prerequisites

Setup Instructions

Pipeline Flow

Configuration

Monitoring and Logging

Performance Considerations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
bcp		bcp
config		config
data		data
infra		infra
scripts		scripts
spark		spark
sql		sql
.gitignore		.gitignore
CONFIGURATION_EXAMPLES.md		CONFIGURATION_EXAMPLES.md
LICENSE		LICENSE
QUICK_START.md		QUICK_START.md
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Spark to Azure Blob to SQL Server BCP Pipeline

Project Structure

Prerequisites

Setup Instructions

Pipeline Flow

Configuration

Monitoring and Logging

Performance Considerations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages