Intermediate Apache Spark 3.X
This intermediate course builds on Spark fundamentals to equip data engineers with advanced techniques for optimizing Spark 3.X applications.
- Overview
- Audience
- Prerequisites
- Curriculum
Description:
This intermediate course builds on Spark fundamentals to equip data engineers with advanced techniques for optimizing Spark 3.X applications. Participants will deepen their understanding of RDD and DataFrame internals, master performance tuning, explore advanced SQL operations, and implement complex data processing patterns. Through practical labs and real-world scenarios, attendees will gain expertise in building production-grade Spark solutions that scale efficiently across distributed clusters.
Duration:Â
2 Days
Course Code: BDT168
Learning Objectives:
After this course, you will be able to:
- Understand Spark 3.X architecture, Catalyst optimizer, and Tungsten memory management for performance gains
- Master advanced RDD operations, custom partitioning, and shuffle optimization techniques
- Leverage DataFrame/Dataset APIs with complex transformations, window functions, and performance best practices
- Implement advanced Spark SQL techniques including Columnar Storage, query execution plans, and cost-based optimization
- Data Engineers looking to optimize Spark applications for production
- Data Scientists and Analysts seeking advanced data processing and transformation skills
- Developers transitioning from traditional batch processing to distributed analytics
Â
- Completion of BDT167 (Beginner Apache Spark) or equivalent hands-on Spark experience
- Working knowledge of Python, Jupyter notebooks, and pandas
- Familiarity with Hadoop/distributed computing concepts and SQL queries
Â
Course Outline:
Day 1: Spark 3.X Architecture & Advanced RDDs
Module 1: Spark 3.X Architecture Deep Dive
- Spark 3.X enhancements: Performance improvements and new features
- Driver-Executor architecture and task scheduling mechanisms
- Catalyst query optimizer: Rule-based and cost-based optimization
- Tungsten project: Off-heap memory management and binary encoding
Â
Module 2: RDD Internals & Advanced Operations
- RDD DAG (Directed Acyclic Graph) and lineage tracking for fault tolerance
- Partitioning strategies: Hash, range, and custom partitioners
- Shuffle operations: Understanding reduce-side joins, group-by-key, and coalesce
- RDD persistence, caching strategies, and storage levels
- Hands-on: Optimizing RDDs with repartitioning and custom partitions
Day 2: DataFrames/Datasets & Advanced SQL
Module 3: DataFrame & Dataset APIs Mastery
- DataFrame vs RDD: Trade-offs and when to use each
- Dataset API: Type safety, encoders, and performance characteristics
- Complex transformations: Window functions, lateral views, and pivot operations
- Joins: Inner, outer, semi, anti, cross—optimization and broadcast patterns
- Aggregations: GroupBy, cube, rollup, and custom aggregation functions (UDAF)
- Hands-on: Building multi-step ETL pipelines with DataFrames
Â
Module 4: Advanced Spark SQL & Query Optimization
- Catalyst Query Optimizer: Physical plan generation and execution
- EXPLAIN plans: Analyzing query execution and identifying bottlenecks
- Advanced SQL: CTEs, recursive queries, subqueries, and analytical functions
- Columnar storage: Parquet, ORC formats and compression trade-offs
- Query hints and hint-based optimization techniques
- Hands-on: Query tuning, index strategies, and performance benchmarking




