Instructor

shambhvi

Intermediate Apache Spark 3.X

10 weeks

All levels

0 lessons

0 quizzes

0 students

Intermediate Apache Spark 3.X

Created By shambhvi
Posted on June 2nd, 2026

Intermediate Apache Spark 3.X

This intermediate course builds on Spark fundamentals to equip data engineers with advanced techniques for optimizing Spark 3.X applications.

Free E-learning Course

Download Brochure

Book a Strategy Call

Overview
Audience
Prerequisites
Curriculum

Description:

This intermediate course builds on Spark fundamentals to equip data engineers with advanced techniques for optimizing Spark 3.X applications. Participants will deepen their understanding of RDD and DataFrame internals, master performance tuning, explore advanced SQL operations, and implement complex data processing patterns. Through practical labs and real-world scenarios, attendees will gain expertise in building production-grade Spark solutions that scale efficiently across distributed clusters.

Duration:

2 Days

Course Code: BDT168

Learning Objectives:

After this course, you will be able to:

Understand Spark 3.X architecture, Catalyst optimizer, and Tungsten memory management for performance gains
Master advanced RDD operations, custom partitioning, and shuffle optimization techniques
Leverage DataFrame/Dataset APIs with complex transformations, window functions, and performance best practices
Implement advanced Spark SQL techniques including Columnar Storage, query execution plans, and cost-based optimization

Data Engineers looking to optimize Spark applications for production
Data Scientists and Analysts seeking advanced data processing and transformation skills
Developers transitioning from traditional batch processing to distributed analytics

Completion of BDT167 (Beginner Apache Spark) or equivalent hands-on Spark experience
Working knowledge of Python, Jupyter notebooks, and pandas
Familiarity with Hadoop/distributed computing concepts and SQL queries

Course Outline:

Day 1: Spark 3.X Architecture & Advanced RDDs

Module 1: Spark 3.X Architecture Deep Dive

Spark 3.X enhancements: Performance improvements and new features
Driver-Executor architecture and task scheduling mechanisms
Catalyst query optimizer: Rule-based and cost-based optimization
Tungsten project: Off-heap memory management and binary encoding

Module 2: RDD Internals & Advanced Operations

RDD DAG (Directed Acyclic Graph) and lineage tracking for fault tolerance
Partitioning strategies: Hash, range, and custom partitioners
Shuffle operations: Understanding reduce-side joins, group-by-key, and coalesce
RDD persistence, caching strategies, and storage levels
Hands-on: Optimizing RDDs with repartitioning and custom partitions

Day 2: DataFrames/Datasets & Advanced SQL

Module 3: DataFrame & Dataset APIs Mastery

DataFrame vs RDD: Trade-offs and when to use each
Dataset API: Type safety, encoders, and performance characteristics
Complex transformations: Window functions, lateral views, and pivot operations
Joins: Inner, outer, semi, anti, cross—optimization and broadcast patterns
Aggregations: GroupBy, cube, rollup, and custom aggregation functions (UDAF)
Hands-on: Building multi-step ETL pipelines with DataFrames

Module 4: Advanced Spark SQL & Query Optimization

Catalyst Query Optimizer: Physical plan generation and execution
EXPLAIN plans: Analyzing query execution and identifying bottlenecks
Advanced SQL: CTEs, recursive queries, subqueries, and analytical functions
Columnar storage: Parquet, ORC formats and compression trade-offs
Query hints and hint-based optimization techniques
Hands-on: Query tuning, index strategies, and performance benchmarking

The curriculum is empty

shambhvi

157 Courses

0.0 Avg Review

Looking for Team Training?

Up-skill your team with a customized, private training

Public Classes

Suitable for small teams and individuals

Get Started

Join the Free 5-day AI LaunchPad course →

Achieve your goals

Achieve your goals

transform your life through education

Intermediate Apache Spark 3.X

Intermediate Apache Spark 3.X

Intermediate Apache Spark 3.X

Description:

Course Outline:

shambhvi

Looking for Team Training?

Public Classes

Get Started

Dask: Data Scientist’s Power Tool

Generating Study Flashcards with NotebookLM

Agentforce Specialist

Mapping Knowledge Connections with NotebookLM

AWS SageMaker Training

Headquarters

Quick Links

Resources

About Us

Newsletter

Follow us

Achieve your goals

Achieve your goals

transform your life through education

Intermediate Apache Spark 3.X

Intermediate Apache Spark 3.X

Intermediate Apache Spark 3.X

Description:

Course Outline:

shambhvi

Looking for Team Training?

Public Classes

Get Started

Related Courses

Dask: Data Scientist’s Power Tool

Generating Study Flashcards with NotebookLM

Agentforce Specialist

Mapping Knowledge Connections with NotebookLM

AWS SageMaker Training

Headquarters

Quick Links

Resources

About Us

Newsletter

Follow us

Modal title