- Overview
- Audience
- Prerequisites
- Curriculum
Description:
This comprehensive 3-day training focuses on Apache Iceberg, a high-performance table format for huge analytic datasets, designed to overcome the limitations of traditional data lakes. Participants will start by understanding the need for modern table formats and exploring Iceberg’s architecture, core features, and capabilities through lectures and hands-on labs.
The training covers schema and partition evolution, snapshot-based isolation, ACID transactions, catalog management, and advanced query optimization. Attendees will gain practical skills in creating, querying, and maintaining Iceberg tables using engines like Spark, Flink, and Trino, as well as integrating Iceberg into Lakehouse architectures.
By the end of the training, participants will be equipped to design and manage Iceberg-based data lakes, implement incremental data processing, and apply governance and security best practices for production workloads.
Duration: 3 Days
Course Code: BDT 513
Learning Objectives:
After this training, participants will be able to:
- Understand the architecture, core features, and advantages of Apache Iceberg over traditional table formats.
- Create, query, and manage Iceberg tables using Spark, Flink, and Trino.
- Implement schema and partition evolution, snapshot-based isolation, and ACID transactions.
- Optimize Iceberg tables for performance and integrate with Lakehouse architectures and BI tools.
This course is ideal for:
- Data engineers working with large-scale data lakes and lakehouse architectures
- Big data developers and architects building pipelines using Spark, Flink, and Trino
- Data platform engineers responsible for table format management and optimization
- Analytics engineers integrating Iceberg with BI tools and reporting platforms
- Experience with SQL and big data query engines (e.g., Spark SQL, Flink SQL)
- Basic understanding of data lakes and distributed file systems
- Familiarity with formats like Parquet, ORC, and Avro
- Basic programming experience in Python or Java for hands-on labs
Course Outline:
Module 1: Introduction to Table Formats
- What are table formats?
- Limitations of traditional data lakes (Hive tables, Parquet-only approach)
- Overview of Apache Iceberg and Delta Lake
- Lab: Compare Hive table vs Iceberg table using Spark SQL
Module 2: Apache Iceberg Fundamentals
- Architecture overview
- Table structure (metadata tree, manifest files, snapshot files)
- Iceberg table types: Partitioned, Unpartitioned
- Iceberg table spec and schema evolution
- Lab: Create and inspect Iceberg table metadata using Spark
Module 3: Partitioning and File Management
- Hidden partitioning
- Partition evolution
- Metadata management
- Lab: Demonstrate hidden partitioning and partition evolution with Iceberg
Module 4: Schema Evolution and Compatibility
- Schema updates (add/drop/rename columns)
- Backward and forward compatibility
- Lab: Perform schema evolution on an Iceberg table and validate read compatibility
Module 5: Snapshot-based Isolation
- Snapshots and time travel
- Version rollback
- Lab: Use snapshot IDs and timestamps to time travel and rollback in Iceberg
Module 6: Read and Write Engines
- Integrations with Spark, Flink, Trino, Presto, Hive
- Reading and writing Iceberg tables using Spark and Flink
- Lab: Read and write Iceberg tables using PySpark and Apache Flink
Module 7: ACID Transactions in Iceberg
- How Iceberg supports ACID semantics
- Snapshot isolation and atomic commits
- Lab: Simulate concurrent writes and validate isolation in Iceberg
Module 8: Catalog Support
- Types: HadoopCatalog, HiveCatalog, RESTCatalog, Glue
- Setting up catalogs
- Lab: Setup and use HiveCatalog and RESTCatalog in Spark
Module 9: Table Maintenance Operations
- Expire snapshots
- Rewrite manifests
- Rewrite data files (compaction)
- Lab: Run expireSnapshots, rewriteDataFiles and observe file changes
Module 10: Advanced Query Optimization
- Predicate pushdown
- File pruning and partition pruning
- Vectorized reading
- Lab: Analyze query plans and measure performance gains with pruning
Module 11: Iceberg + Parquet + ORC Avro Formats
- Supported formats and file structure
- Compatibility matrix
- Lab: Read Iceberg tables in Parquet and ORC formats using Spark
Module 12: Integration with Lakehouse & BI Tools
- Iceberg in Lakehouse architectures
Module 13: Incremental Data Processing
- Using snapshot_id, timestamp, changed_data_files
- Lab: Implement incremental ETL with Apache Flink using Iceberg
Module 14: Governance, Security, and Best Practices
- Data retention and GDPR compliance
- Table access controls
- Audit logs and lineage
- Iceberg best practices in production
Training material provided: Yes (Digital format)




