-
Notifications
You must be signed in to change notification settings - Fork 0
AWS Glue Script for Data Migration Explained
This script automates the process of extracting data from Google BigQuery tables and storing it in Amazon S3 for further analysis or storage.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrameargs = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)This section sets up the necessary environment for running the AWS Glue job, including initializing the Spark context, Glue context, and Spark session.
table_names = ["linksfinal", "postsfinal", "blocksfinal", "tagsfinal", "mentionsfinal", "followsfinal"]A list of table names (table_names) is defined. These are the names of the tables from Google BigQuery that will be processed and migrated to Amazon S3.
for table_name in table_names:A loop is initiated to iterate over each table name in the table_names list.
GoogleBigQuery_node1698466006405 = glueContext.create_dynamic_frame.from_options(
connection_type="bigquery",
connection_options={
"connectionName": "Big Query Connection",
"parentProject": "infinite-rope-363317",
"sourceType": "table",
"table": f"bluesky_social.{table_name}",
},
transformation_ctx="GoogleBigQuery_node1698466006405",
)For each table, a DynamicFrame is created from Google BigQuery using the create_dynamic_frame.from_options function. It specifies the connection type as BigQuery and provides connection options such as the connection name, parent project, source type, and table name.
AmazonS3_node1698466010194 = glueContext.write_dynamic_frame.from_options(
frame=GoogleBigQuery_node1698466006405,
connection_type="s3",
format="glueparquet",
connection_options={
"path": f"s3://arbiter.datasets/data/bluesky_social/{table_name}/",
"partitionKeys": [],
},
format_options={"compression": "snappy"},
transformation_ctx="AmazonS3_node1698466010194",
)The DynamicFrame obtained from Google BigQuery is then written to Amazon S3 using the write_dynamic_frame.from_options function. It specifies the connection type as S3 and provides connection options such as the S3 path where the data will be written, partition keys (if any), and format options (in this case, specifying the Glue Parquet format with Snappy compression).
job.commit()Finally, the job is committed, indicating that all processing and writing tasks have been completed for the current iteration.