Data Engineering on Google Cloud Platform Training
- Created By shambhvi
- Posted on September 8th, 2025
- Overview
- Audience
- Prerequisites
- Curriculum
Description:
This official Google Cloud training provides a comprehensive introduction to data engineering on GCP, equipping participants with the skills to design, build, and manage scalable data pipelines for both batch and streaming workloads.
The training covers the entire data lifecycle on GCP — from ingestion and storage to processing, orchestration, governance, and analytics. Learners explore core services, including Cloud Pub/Sub, Dataflow, BigQuery, Dataproc, Cloud Storage, and Data Fusion, while also gaining exposure to advanced capabilities such as Dataflow streaming, BigQuery ML, Cloud Composer, and AI/ML integration.
Through a blend of demos, labs, and case studies, participants will:
- Build and optimize ETL/ELT workflows.
- Design data lakes and warehouses on GCP.
- Implement streaming pipelines for real-time analytics.
- Use orchestration tools (Cloud Composer, Data Fusion) to manage production pipelines.
- Apply governance and security best practices.
- Explore analytics and machine learning options directly within GCP.
Duration: 5 Days
Course Code: BDT 524
Learning Objectives:
After this training, participants will be able to:
- Ingest and process data with Pub/Sub and Dataflow
- Build warehouses with BigQuery
- Perform ML with BigQuery ML
- Secure and monitor pipelines
- Cloud architects
- Systems engineers
- Developers working with GCE
- Knowledge of SQL
- Experience with data processing tools
- Basic understanding of cloud services
Course Outline:
Module 1: Introduction to Data Engineering on GCP
- Role of a data engineer
- Data engineering challenges
- Data lakes vs data warehouses
- GCP data ecosystem overview
- Case study: Real-world GCP customer pipeline
Module 2: Data Ingestion
- Introduction to Cloud Pub/Sub
- Batch vs streaming ingestion patterns
- Basics of Cloud Dataflow pipelines
- Lab: Publishing and processing streaming data with Pub/Sub & Dataflow
Module 3: Data Storage
- Cloud Storage (buckets, lifecycle, security)
- Relational storage: Cloud SQL, Spanner
- NoSQL storage: Datastore/Firestore, Bigtable
- Lab: Loading structured/unstructured data into GCP storage services
Module 4: Data Warehousing with BigQuery
- Introduction to BigQuery as a modern data warehouse
- Loading and querying data
- Partitioning, clustering, schema design
- Lab: Running federated queries on external datasets
Module 5: Batch Data Processing
- ETL/ELT concepts on GCP
- Quality considerations and transformations
- Using Dataproc for Spark and Hadoop workloads
- Lab: Running Spark jobs on Dataproc
Module 6: Streaming Data Processing
- Streaming concepts and challenges
- Dataflow streaming pipelines
- BigQuery streaming inserts
- Cloud Bigtable for high-throughput streaming
- Lab: Real-time analytics dashboard with Pub/Sub + Dataflow + BigQuery
Module 7: Data Orchestration and Automation
- Cloud Data Fusion for visual ETL design
- Cloud Composer (Apache Airflow) for orchestration
- Scheduling workflows and monitoring pipelines
- Infrastructure automation with Deployment Manager and Terraform (intro)
- Lab: Building a pipeline with Data Fusion and Composer
Module 8: Monitoring, Security, and Governance
- Cloud Monitoring, Logging, Error Reporting, Tracing
- Data security and access control with IAM and DLP API
- Governance best practices (projects, quotas, billing)
- Lab: Detecting PII with Cloud DLP
Module 9: Analytics and Machine Learning with BigQuery
- BigQuery ML: SQL-based ML models
- Common supported models
- Performance tuning for large queries
- Lab: Building a predictive model with BigQuery ML
Module 10: AI/ML Services for Data Enrichment
- Pre-built ML APIs (Vision, NLP, Translation) for unstructured data
- Cloud AI Platform Notebooks for data exploration
- AutoML for custom ML models
- Kubeflow pipelines for production ML workflows
- Lab: Classifying text data with Cloud NLP API
Module 11: Containers and Modern App Integration
- Using GCP with containerized workloads
- Google Kubernetes Engine (GKE) for scalable data apps
- Cloud Run for serverless data services
- Cloud Pub/Sub + Functions for event-driven data processing
Training material provided: Yes (Digital format)




