Instructor

shambhvi

Mastering Apache Iceberg for Modern Data Lakes

10 weeks

All levels

0 lessons

0 quizzes

0 students

Mastering Apache Iceberg for Modern Data Lakes

Created By shambhvi
Posted on August 29th, 2025

Overview
Audience
Prerequisites
Curriculum

Description:

This comprehensive 3-day training focuses on Apache Iceberg, a high-performance table format for huge analytic datasets, designed to overcome the limitations of traditional data lakes. Participants will start by understanding the need for modern table formats and exploring Iceberg’s architecture, core features, and capabilities through lectures and hands-on labs.

The training covers schema and partition evolution, snapshot-based isolation, ACID transactions, catalog management, and advanced query optimization. Attendees will gain practical skills in creating, querying, and maintaining Iceberg tables using engines like Spark, Flink, and Trino, as well as integrating Iceberg into Lakehouse architectures.

By the end of the training, participants will be equipped to design and manage Iceberg-based data lakes, implement incremental data processing, and apply governance and security best practices for production workloads.

Duration: 3 Days

Course Code: BDT 513

Learning Objectives:

After this training, participants will be able to:

Understand the architecture, core features, and advantages of Apache Iceberg over traditional table formats.
Create, query, and manage Iceberg tables using Spark, Flink, and Trino.
Implement schema and partition evolution, snapshot-based isolation, and ACID transactions.
Optimize Iceberg tables for performance and integrate with Lakehouse architectures and BI tools.

This course is ideal for:

Data engineers working with large-scale data lakes and lakehouse architectures
Big data developers and architects building pipelines using Spark, Flink, and Trino
Data platform engineers responsible for table format management and optimization
Analytics engineers integrating Iceberg with BI tools and reporting platforms

Experience with SQL and big data query engines (e.g., Spark SQL, Flink SQL)
Basic understanding of data lakes and distributed file systems
Familiarity with formats like Parquet, ORC, and Avro
Basic programming experience in Python or Java for hands-on labs

Course Outline:

Module 1: Introduction to Table Formats

What are table formats?
Limitations of traditional data lakes (Hive tables, Parquet-only approach)
Overview of Apache Iceberg and Delta Lake
Lab: Compare Hive table vs Iceberg table using Spark SQL

Module 2: Apache Iceberg Fundamentals

Architecture overview
Table structure (metadata tree, manifest files, snapshot files)
Iceberg table types: Partitioned, Unpartitioned
Iceberg table spec and schema evolution
Lab: Create and inspect Iceberg table metadata using Spark

Module 3: Partitioning and File Management

Hidden partitioning
Partition evolution
Metadata management
Lab: Demonstrate hidden partitioning and partition evolution with Iceberg

Module 4: Schema Evolution and Compatibility

Schema updates (add/drop/rename columns)
Backward and forward compatibility
Lab: Perform schema evolution on an Iceberg table and validate read compatibility

Module 5: Snapshot-based Isolation

Snapshots and time travel
Version rollback
Lab: Use snapshot IDs and timestamps to time travel and rollback in Iceberg

Module 6: Read and Write Engines

Integrations with Spark, Flink, Trino, Presto, Hive
Reading and writing Iceberg tables using Spark and Flink
Lab: Read and write Iceberg tables using PySpark and Apache Flink

Module 7: ACID Transactions in Iceberg

How Iceberg supports ACID semantics
Snapshot isolation and atomic commits
Lab: Simulate concurrent writes and validate isolation in Iceberg

Module 8: Catalog Support

Types: HadoopCatalog, HiveCatalog, RESTCatalog, Glue
Setting up catalogs
Lab: Setup and use HiveCatalog and RESTCatalog in Spark

Module 9: Table Maintenance Operations

Expire snapshots
Rewrite manifests
Rewrite data files (compaction)
Lab: Run expireSnapshots, rewriteDataFiles and observe file changes

Module 10: Advanced Query Optimization

Predicate pushdown
File pruning and partition pruning
Vectorized reading
Lab: Analyze query plans and measure performance gains with pruning

Module 11: Iceberg + Parquet + ORC Avro Formats

Supported formats and file structure
Compatibility matrix
Lab: Read Iceberg tables in Parquet and ORC formats using Spark

Module 12: Integration with Lakehouse & BI Tools

Iceberg in Lakehouse architectures

Module 13: Incremental Data Processing

Using snapshot_id, timestamp, changed_data_files
Lab: Implement incremental ETL with Apache Flink using Iceberg

Module 14: Governance, Security, and Best Practices

Data retention and GDPR compliance
Table access controls
Audit logs and lineage
Iceberg best practices in production

Training material provided: Yes (Digital format)

The curriculum is empty

shambhvi

90 Courses

0.0 Avg Review

Looking for Team Training?

Up-skill your team with a customized, private training

Public Classes

Suitable for small teams and individuals

Achieve your goals

Achieve your goals

transform your life through education

Achieve your goals

Achieve your goals

transform your life through education

Mastering Apache Iceberg for Modern Data Lakes

Mastering Apache Iceberg for Modern Data Lakes

Description:

Course Outline:

Module 1: Introduction to Table Formats

Module 2: Apache Iceberg Fundamentals

Module 3: Partitioning and File Management

Module 4: Schema Evolution and Compatibility

Module 5: Snapshot-based Isolation

Module 6: Read and Write Engines

Module 7: ACID Transactions in Iceberg

Module 8: Catalog Support

Module 9: Table Maintenance Operations

Module 10: Advanced Query Optimization

Module 11: Iceberg + Parquet + ORC Avro Formats

Module 12: Integration with Lakehouse & BI Tools

Module 13: Incremental Data Processing

Module 14: Governance, Security, and Best Practices

shambhvi

Looking for Team Training?

Public Classes

Get Started

Data Science for Finance

Hadoop & Spark Developer Training with Scala and Python

Byte-Sized Deep Learning Series: Handling Text Data with Ker

Introduction to Artificial Intelligence for Understanding La

AWS Architect Associate Training

Headquarters

Quick Links

resources

About Us

Newsletter

follow us

Achieve your goals

Achieve your goals

transform your life through education

Achieve your goals

Achieve your goals

transform your life through education

Mastering Apache Iceberg for Modern Data Lakes

Mastering Apache Iceberg for Modern Data Lakes

Description:

Course Outline:

Module 1: Introduction to Table Formats

Module 2: Apache Iceberg Fundamentals

Module 3: Partitioning and File Management

Module 4: Schema Evolution and Compatibility

Module 5: Snapshot-based Isolation

Module 6: Read and Write Engines

Module 7: ACID Transactions in Iceberg

Module 8: Catalog Support

Module 9: Table Maintenance Operations

Module 10: Advanced Query Optimization

Module 11: Iceberg + Parquet + ORC Avro Formats

Module 12: Integration with Lakehouse & BI Tools

Module 13: Incremental Data Processing

Module 14: Governance, Security, and Best Practices

shambhvi

Looking for Team Training?

Public Classes

Get Started

Related Courses

Data Science for Finance

Hadoop & Spark Developer Training with Scala and Python

Byte-Sized Deep Learning Series: Handling Text Data with Ker

Introduction to Artificial Intelligence for Understanding La

AWS Architect Associate Training

Headquarters

Quick Links

resources

About Us

Newsletter

follow us

Modal title