GCP Serverless Data Pipeline Case Study

Background

A private retail company needed a data pipeline to better understand their consumers. The goal was to automate the ingestion and processing of incoming data files, a workflow that had previously required significant manual effort, costing the team multiple hours of work each day.

Architecture

  • Infrastructure provisioned and managed with Terraform (GCS, Cloud Functions, Cloud Run, BigQuery, IAM)
  • Files are uploaded to a Cloud Storage bucket
  • A Cloud Function is triggered on object finalization events
  • Validated data is forwarded to a Cloud Run service
  • Processed records are loaded into BigQuery
Serverless data pipeline architecture diagram

Why These Services?

Cloud Functions was the natural fit for the trigger layer. It responds directly to GCS object finalization events with minimal configuration, and its pay-per-use model kept costs aligned with actual file volume rather than paying for idle infrastructure.

Cloud Run handles the processing layer separately. Rather than doing everything inside the Cloud Function, separating validation from processing gave each service a single responsibility. Easier to test, easier to reason about, and easier to scale independently if processing demands grew.

BigQuery was chosen as the destination because it's purpose-built for analytical queries at scale, which is exactly what consumer behavior analysis requires. It also eliminates the overhead of managing a database, keeping the overall system fully serverless.

IAM Challenges

IAM was one of the more time-intensive parts of the project. The main challenge was avoiding the trap of broad default roles. Instead, each service account was scoped to the minimum permissions actually needed. Getting those boundaries right required careful iteration: too narrow and things break silently, too broad, and you've undermined the security model. Adding to the complexity, many permission gaps only surface at runtime rather than at deploy time, making the process inherently iterative.

IAM resources were managed as a mix of Terraform and manual configuration. In hindsight, moving more of it into Terraform would make the setup more reproducible, something worth refining in future projects.

Outcome

The pipeline reduced manual data processing by multiple hours per day, freeing the team to focus on analysis rather than data wrangling. The fully serverless architecture means it scales with demand and requires no infrastructure maintenance.

Note: The source code is private as this was built for a client.