This service processes PDF files stored in a Google Drive folder and demonstrates a simple, cloud-native processing pipeline.
Retrieve → Parse → Transform → Decide → Route
The current implementation extracts a simple title from the PDF text to generate a filename. This transformation step is intentionally minimal and serves only as an example.
The pipeline is designed so that transformation logic can be replaced via pluggable policies (strategy pattern) without changing the overall workflow.
The system is designed to be:
- Deterministic — the same input produces the same outcome
- Idempotent — files can be safely retried without duplication
- Cloud-native — built around Cloud Run and Cloud Tasks
- Fully automated — no manual intervention required
- Extensible — transformation logic can evolve independently
At a high level the service performs the following steps:
- Retrieve PDF files from a “to process” Google Drive folder.
- Download each file.
- Extract a simple title from the PDF text.
- Check whether the document is a duplicate.
- Route the file to the appropriate folder:
- ✅ Processed — renamed and stored
- ♻️ Duplicates — duplicate content detected
- ❌ Failed — processing error occurred
File discovery and file processing are intentionally decoupled. A dispatcher scans the Drive folder and creates one Cloud Task per file. Cloud Tasks then controls processing rate, retries, and delivery.
Google Drive folder
│
▼
Dispatcher scans files
│
▼
enqueueHttpTask({ fileId })
│
▼
Cloud Tasks queue
(rate limited)
│
▼
Cloud Run worker endpoint
│
▼
PDF processing pipeline
Each file is processed independently through the queue. Key principles:
- 1 task = 1 PDF
- Workers process exactly one file per task
- Failures and retries occur per file, not per batch
This approach isolates failures, enables controlled throughput, and prevents bursts of uploads from overwhelming downstream APIs.
-
main– Production branch. This branch represents the stable and production-ready version of the code. It is used for deployments to the live environment. -
dev– Development branch. This is the default branch for ongoing development work. It is where new features and bug fixes are implemented and tested before being merged into the main branch. It is used for deployments to the staging environment.
The service expects the following Google Drive folders:
- PDFS_TO_PROCESS_FOLDER_ID
- PDFS_PROCESSED_FOLDER_ID
- PDFS_DUPLICATES_FOLDER_ID
- PDFS_FAILED_FOLDER_ID
- Enable Google Drive API for your project.
- Create a Service Account in Google Cloud (do not assign any roles)
- Generate JSON key, stringify and save to an env var
GOOGLE_SERVICE_ACCOUNT_JSON - Share Google Drive folders with the service account - enter the service account email and assign Editor or Viewer role, depending on your needs
npm run build
docker build -t pdf-processing-service .
# With environment variables
docker run --env-file .env -p 8080:8080 pdf-processing-serviceNote: Ensure required environment variables are defined in your local .env file.
Enable Cloud Run API and Secret Manager
gcloud services enable run.googleapis.com
gcloud services enable secretmanager.googleapis.com
# Command
gcloud iam service-accounts create <SERVICE_ACCOUNT_NAME> \
--display-name="<DISPLAY_NAME>"
# Example
gcloud iam service-accounts create pdf-processing-service-sa \
--display-name="PDF Processing Cloud Run Service"Get the email
gcloud iam service-accounts list --filter="email:pdf-processing-service-sa"Grant permissions
# Command
gcloud projects add-iam-policy-binding <PROJECT_ID> \
--member="serviceAccount:<SERVICE_ACCOUNT_NAME>@<PROJECT_ID>.iam.gserviceaccount.com" \
--role="roles/secretmanager.secretAccessor"
# Example
gcloud projects add-iam-policy-binding drive-pdf-processing-pipeline \
--member="serviceAccount:pdf-processing-service-sa@drive-pdf-processing-pipeline.iam.gserviceaccount.com" \
--role="roles/secretmanager.secretAccessor"# Command
gcloud auth configure-docker <REGION>-docker.pkg.dev
# Example
gcloud auth configure-docker europe-west3-docker.pkg.dev# Command
gcloud artifacts repositories create <REPOSITORY_NAME> \
--project=<PROJECT_ID> \
--repository-format=docker \
--location=<REGION> \
--description="Docker repository for <DESCRIPTION>"
# Example
gcloud artifacts repositories create pdf-processing-repo \
--project=drive-pdf-processing-pipeline \
--repository-format=docker \
--location=europe-west3 \
--description="Docker repository for PDF processing service"Only secrets. No configs should be placed here.
gcloud secrets create GOOGLE_SERVICE_ACCOUNT_JSON# First, compile Typescript locally:
npm run build
# When you’re on an M1/M2 Mac and deploying to Cloud Run:
docker buildx build --platform linux/amd64 -t <LOCAL_IMAGE_NAME> <BUILD_CONTEXT>
# Example
docker buildx build --platform linux/amd64 -t pdf-processing-service .docker tag <LOCAL_IMAGE_NAME> <REGION>-docker.pkg.dev/<PROJECT_ID>/<REPOSITORY_NAME>/<REMOTE_IMAGE_NAME>
# Example
docker tag pdf-processing-service europe-west3-docker.pkg.dev/drive-pdf-processing-pipeline/pdf-processing-repo/pdf-processing-service# Command
docker push <REGION>-docker.pkg.dev/<PROJECT_ID>/<REPOSITORY_NAME>/pdf-processing-service
# Example
docker push europe-west3-docker.pkg.dev/drive-pdf-processing-pipeline/pdf-processing-repo/pdf-processing-service# Command
gcloud run deploy <SERVICE_NAME> \
--image <REGION>-docker.pkg.dev/<PROJECT_ID>/<REPOSITORY_NAME>/<IMAGE_NAME> \
--region <REGION> \
--service-account=<SERVICE_ACCOUNT_EMAIL> \
--concurrency=<CONCURRENCY> \
--max-instances=<MAX_INSTANCES> \
--set-env-vars "<CONFIG_VAR_1>=<VALUE_1>,<CONFIG_VAR_2>=<VALUE_2>" \
--update-secrets "<SECRET_ENV_VAR_1>=<SECRET_NAME_1>:latest"# Eample
gcloud run deploy pdf-processing-service \
--image europe-west3-docker.pkg.dev/drive-pdf-processing-pipeline/pdf-processing-repo/pdf-processing-service \
--region europe-west3 \
--service-account=pdf-processing-service-sa@drive-pdf-processing-pipeline.iam.gserviceaccount.com \
--concurrency=1 \
--max-instances=1 \
--set-env-vars "PDFS_TO_PROCESS_FOLDER_ID=1863uE4CLsfpogKtEt3kOrJlg05F54rOO,PDFS_PROCESSED_FOLDER_ID=1GFxPgOeoqQlT2vfPdYVzO4TM0XkmpiQC,PDFS_DUPLICATES_FOLDER_ID=1tA3xSjQ0nz68vWa_8SC14OBw-Cdhadn1,PDFS_FAILED_FOLDER_ID=1LAIY15MwdJnl8WRv_nkAfBfRB65GTwjQ" \
--update-secrets "GOOGLE_SERVICE_ACCOUNT_JSON=GOOGLE_SERVICE_ACCOUNT_JSON:latest"gcloud run deploy pdf-processing-service \
--image europe-west3-docker.pkg.dev/drive-pdf-processing-pipeline/pdf-processing-repo/pdf-processing-service \
--region europe-west3Go to the Log Explorer ane run this query:
resource.type="cloud_run_revision"
resource.labels.service_name="<SERVICE_NAME>"
You can test private Cloud Run services using your own Google identity. Your identity is simply the Google account you authenticated with gcloud.
First, grant permissions:
gcloud run services add-iam-policy-binding <SERVICE-NAME> \
--region=europe-west3 \
--member="user:<EMAIL>" \
--role="roles/run.invoker"Second, generate an identity token:
gcloud auth print-identity-tokenAnd use it as a Bearer token in the Authorization header.
Before creating the queue, enable the required Google Cloud APIs.
gcloud services enable cloudtasks.googleapis.com
gcloud services enable iamcredentials.googleapis.comBefore running the service, create the Cloud Tasks queue used to dispatch PDF processing jobs.
This only needs to be done once per environment.
#Command
gcloud tasks queues create <QUEUE_NAME> \
--location=<REGION> \
--max-dispatches-per-second=<DISPATCH_RATE> \
--max-concurrent-dispatches=<MAX_CONCURRENCY>
# Example
gcloud tasks queues create pdf-processing-queue \
--location=europe-west3 \
--max-dispatches-per-second=1 \
--max-concurrent-dispatches=1The queue throttles execution so that PDFs are processed in a controlled way.
- max-dispatches-per-second: 1 → prevents API spikes
- max-concurrent-dispatches: 1 → ensures only one PDF is processed at a time
If many PDFs are uploaded simultaneously, they will be queued and processed sequentially.
# Command
gcloud iam service-accounts create <SERVICE_ACCOUNT_NAME> \
--display-name="<DISPLAY_NAME>"
# Example
gcloud iam service-accounts create pdf-processing-task-invoker \
--display-name="PDF Processing Task Invoker"Verify and get the email:
gcloud iam service-accounts list --filter="email:pdf-processing-task-invoker"
# Command
gcloud run services add-iam-policy-binding <SERVICE_NAME> \
--member="serviceAccount:<TASK_INVOKER_SERVICE_ACCOUNT>@<PROJECT_ID>.iam.gserviceaccount.com" \
--role="roles/run.invoker" \
--region=<REGION>
# Example
gcloud run services add-iam-policy-binding pdf-processing-service \
--member="serviceAccount:pdf-processing-task-invoker@drive-pdf-processing-pipeline.iam.gserviceaccount.com" \
--role="roles/run.invoker" \
--region=europe-west3This allows Cloud Tasks to call the worker over HTTPS using an OIDC token from that service account.
# Command
gcloud projects add-iam-policy-binding <PROJECT_ID> \
--member="serviceAccount:<ENQUEUER_SERVICE_ACCOUNT>@<PROJECT_ID>.iam.gserviceaccount.com" \
--role="roles/cloudtasks.enqueuer"
# Example
gcloud projects add-iam-policy-binding drive-pdf-processing-pipeline \
--member="serviceAccount:pdf-processing-service-sa@drive-pdf-processing-pipeline.iam.gserviceaccount.com" \
--role="roles/cloudtasks.enqueuer"This is needed because Cloud Run service runtime identity is what calls the Cloud Tasks API to create the task. roles/cloudtasks.enqueuer includes cloudtasks.tasks.create.
# Command
gcloud iam service-accounts add-iam-policy-binding \
<TASK_INVOKER_SERVICE_ACCOUNT>@<PROJECT_ID>.iam.gserviceaccount.com \
--member="serviceAccount:<ENQUEUER_SERVICE_ACCOUNT>@<PROJECT_ID>.iam.gserviceaccount.com" \
--role="roles/iam.serviceAccountUser"
# Example
gcloud iam service-accounts add-iam-policy-binding \
pdf-processing-task-invoker@drive-pdf-processing-pipeline.iam.gserviceaccount.com \
--member="serviceAccount:pdf-processing-service-sa@drive-pdf-processing-pipeline.iam.gserviceaccount.com" \
--role="roles/iam.serviceAccountUser"This is needed only if the service creating the task is different from the service account specified in oidcToken.serviceAccountEmail.
Your application needs the queue and invoker identity in its runtime config.
Required env vars
GCP_PROJECT_ID
CLOUD_TASKS_LOCATION
CLOUD_TASKS_PDF_PROCESSING_QUEUE_NAME
CLOUD_TASKS_INVOKER_SERVICE_ACCOUNT_EMAIL
If the Cloud Run service already exists, update it with:
# Command
gcloud run services update <SERVICE_NAME> \
--region=<REGION> \
--update-env-vars "<ENV_VAR_1>=<VALUE_1>,<ENV_VAR_2>=<VALUE_2>"
# Example
gcloud run services update pdf-processing-service \
--region=europe-west3 \
--update-env-vars "GCP_PROJECT_ID=drive-pdf-processing-pipeline,CLOUD_TASKS_LOCATION=europe-west3,CLOUD_TASKS_PDF_PROCESSING_QUEUE_NAME=pdf-processing-queue,CLOUD_TASKS_INVOKER_SERVICE_ACCOUNT_EMAIL=pdf-processing-task-invoker@drive-pdf-processing-pipeline.iam.gserviceaccount.com"Get the worker URL:
# Command
gcloud run services describe <SERVICE_NAME> \
--region=<REGION> \
--format='value(status.url)'
# Example
gcloud run services describe pdf-processing-service \
--region=europe-west3 \
--format='value(status.url)'Note: The worker URL must include the processing endpoint path, not just the base Cloud Run service URL.
https://<CLOUD_RUN_SERVICE_URL>/<PROCESSING_ENDPOINT>Redeploy
gcloud run services update pdf-processing-service \
--region=europe-west3 \
--update-env-vars "PDF_PROCESSING_WORKER_URL=https://pdf-processing-service-m6hxyu3hsa-ey.a.run.app/process-pdf"Create a dedicated service account that Cloud Scheduler will use to generate the OIDC token when calling the Cloud Run endpoint.
gcloud iam service-accounts create pdf-processing-scheduler-sa \
--display-name="PDF Processing Scheduler Invoker"Verify and retrieve the email:
gcloud iam service-accounts list \
--filter="email:pdf-processing-scheduler-sa"Outputs pdf-processing-scheduler-sa@drive-pdf-processing-pipeline.iam.gserviceaccount.com
Allow the Scheduler service account to invoke the Cloud Run service.
Run:
gcloud run services add-iam-policy-binding pdf-processing-service \
--region=europe-west3 \
--member="serviceAccount:pdf-processing-scheduler-sa@drive-pdf-processing-pipeline.iam.gserviceaccount.com" \
--role="roles/run.invoker"Create a Scheduler job that calls the Cloud Run cron endpoint using an OIDC identity token.
The token will be generated for the Scheduler service account and verified by Cloud Run before the request reaches the application.
Required parameters:
--oidc-service-account-email→ service account used to generate the identity token--oidc-token-audience→ the Cloud Run service URL the token is intended for
Example:
gcloud scheduler jobs create http process-pdfs-cron \
--location=europe-west3 \
--schedule="*/5 6-18 * * 1-5" \
--time-zone="Europe/Amsterdam" \
--uri="https://pdf-processing-service-m6hxyu3hsa-ey.a.run.app/cron/pdf-processing-tasks" \
--http-method=POST \
--oidc-service-account-email="pdf-processing-scheduler-sa@drive-pdf-processing-pipeline.iam.gserviceaccount.com" \
--oidc-token-audience="https://pdf-processing-service-m6hxyu3hsa-ey.a.run.app/"