dots bg

S25 Data Engineering Program

Course Instructor Sumit Mittal

dots bg

Course Overview

Schedule of Classes

Course Curriculum

20 Subjects

Welcome Session

1 Learning Materials

Welcome Session

Welcome Session

Video
00:40:28

Week 1: Big Data - The Big Picture

1 Exercises10 Learning Materials

Big Data Fundamentals

Introduction to Big Data

Video
00:32:40

Hadoop Overview

Video
00:30:37

Cloud and its Advantages

Video
00:20:29

Understanding Apache Spark at a High-Level

Video
00:15:08

Database Vs Data Warehouse Vs Data Lake

Video
00:24:40

Big Data - The Big Picture

Video
00:39:45

Hadoop Distributed File System - HDFS Architecture

Video
00:17:34

Role of Data Engineers

Video
00:22:35

Running Notes & Summary Document

Week 1 Running Notes

PDF

Week 1 Summary Document

PDF

Week-1 Quiz

Big Data - The Big Picture

Exercise

Week 2: Distributed Storage and Data Lake

1 Exercises14 Learning Materials

Datalake Storage & Getting Started with the Labs

HDFS Overview

Video
00:38:15

About Practice Labs

Video
00:31:05

Linux Commands - cd & ls

Video
00:45:53

More Linux Commands

Video
00:46:27

HDFS Commands

Video
00:43:53

More About Practice Lab

Video
00:17:04

HDFS Vs Cloud Data Lake

Video
00:15:03

Getting Started with Distributed Processing

Video
00:42:55

Running Notes & Summary Document

Week 2 Running Notes

PDF

Week 2 Summary Document

PDF

Steps to SSH in Windows

PDF

FAQs

Week 2 : Frequently Asked Questions

PDF

Week 2: Quiz

Distributed Storage and Data Lake

Exercise

Week 2: Assignment

Weekly Assignment

PDF

Assignment Solution

PDF

Week 3: Distributed Processing Fundamentals

1 Exercises18 Learning Materials

Downloadable Resources

Distributed Processing - Program Jar Files (Executing MapReduce Jar )

PDF

Week 3 : Reference Jupyter Notebooks

ZIP

Downloadable : Practice Datasets

More About Distributed Processing

Distributed Processing Continuation

Video
00:18:13

Changing the Number of Reducers

Video
00:30:23

Use Case 1 - Sensor Data Example

Video
00:31:51

Real-time Industry Use Case of Distributed Processing

Video
00:09:58

Distributed Computing Demo

Video
00:37:24

Distributed Computing Demo Continuation

Video
00:34:48

Apache Spark

Getting to Know Apache Spark

Video
00:31:51

Apache Spark Vs Databricks

Video
00:23:58

Spark Execution Plan

Video
00:33:22

Word Count Example in Apache Spark

Video
00:40:05

Running Notes & Summary Document

Week 3 Running Notes

PDF

Summary Document

PDF

FAQs

Week 3 : Frequently Asked Questions

PDF

Week 3: Quiz

Distributed Processing Fundamentals

Exercise

Week 3: Assignment

Weekly Assignment

PDF

Assignment Solution

PDF

Week 4: Apache Spark Core APIs

1 Exercises17 Learning Materials

Downloadable Resources

Datasets

Week 4 : Reference Jupyter Notebooks

ZIP

Distributed Processing - Pyspark In-Depth

Python Basics

Video
00:25:44

Spark Usecase 1 - Orders Data

Video
00:45:50

Spark Core APIs - RDD

Video
00:26:51

More on Spark Parallelize

Video
00:27:15

More Spark Transformations

Video
00:25:22

Spark DAG Visualization | reduce Vs reduceByKey

Video
00:14:37

reduceByKey Vs groupByKey

Video
00:45:38

Spark Join

Video
00:24:09

Broadcast Joins

Video
00:25:09

Repartition Vs Coalesce

Video
00:21:26

Cache

Video
00:21:07

Running Notes & Summary Document

Running Notes

PDF

Summary Document

PDF

Week 4: Quiz

Apache Spark Core APIs

Exercise

Week 4: Assignment

Weekly Assignment

PDF

Assignment Solution

PDF

Week 5: park APIs - Dataframes & Spark SQL

1 Exercises15 Learning Materials

Downloadable Resources

Week 5 : Reference Jupyter Notebooks

ZIP

Higher Level APIs in Apache Spark

Higher Level APIs - Dataframes & Spark SQL

Video
00:22:43

Understanding Dataframes

Video
00:26:56

More about Dataframe Reader

Video
00:34:13

Database Creation : Setting Configuration Properties

PDF

Introducing Spark SQL

Video
00:26:33

Spark SQL - Managed vs External Tables

Video
00:24:02

Use Case - Dataframes & Spark SQL

Video
00:38:03

Getting Started with Spark Optimizations

Video
00:05:36

Spark Executors

Video
00:22:55

A Little More on Spark Executors

Video
00:18:25

Running Notes & Summary Document

Running Notes

PDF

Summary Document

PDF

Week 5: Quiz

Spark APIs - Dataframes & Spark SQL

Exercise

Week 5: Assignment

Weekly Assignment

PDF

Assignment Solution

PDF

Week 6: Spark Dataframe Transformations

1 Exercises15 Learning Materials

Downloadable Resources

Week 6 : Reference Jupyter Notebooks

ZIP

Spark Transformations

Recap of the Concepts Learnt

Video
00:15:26

Schema Enforcement

Video
00:32:31

How to Deal with Date Type

Video
00:31:19

Read Modes

Video
00:14:30

Different ways of Dataframe Creation

Video
00:40:25

Converting RDD to Dataframe

Video
00:18:28

Nested Schema

Video
00:19:56

Dataframe Transformations | select Vs selectExpr

Video
00:30:51

Removal of Duplicates from Dataframe

Video
00:11:50

Spark Session in Detail

Video
00:52:13

Running Notes & Summary Document

Week 6 Running Notes

PDF

Summary Document

PDF

Week 6: Quiz

Spark Dataframe Transformations

Exercise

Week 6: Assignment

Weekly Assignment

PDF

Assignment Solution

PDF

Week 7: Apache Spark Caching In-Depth

1 Exercises18 Learning Materials

Downloadable Resources

Accessing Spark UI

Week 7 : Reference Jupyter Notebooks

ZIP

Distributed Processing - Pyspark In-Depth

Spark UI

Video
00:16:48

Accessing Spark UI

Steps for Accessing Spark UI

PDF

Understanding Cache & Persist

Video
00:23:25

Cache Practicals

Video
00:57:17

More on Cache

Video
00:39:57

Parsed | Analyzed | Optimized Logical Plan

Video
00:09:12

Cache - InMemory Table Cache | Node Local & Process Local

Video
00:24:49

Caching Spark Table

Video
00:34:46

Spark Catalog, Managed & External Tables

Video
00:34:02

Cache Performance

Video
00:30:00

Understanding Persist

Video
00:24:58

Running Notes & Summary Document

Week 7 Running Notes

PDF

Summary Document

PDF

Week 7: Quiz

Apache Spark Caching In-Depth

Exercise

Week 7: Assignment

Weekly Assignment

PDF

Assignment Solution

PDF

Week 8: Apache Spark Architecture

1 Exercises16 Learning Materials

Downloadable Resources

Week 8 : Reference Jupyter Notebooks

ZIP

Spark Architecture | Aggregate & Window Functions

Spark On YARN Architecture

Video
00:37:54

More on Spark Architecture

Video
00:26:33

Ways of Accessing Columns in PySpark

Video
00:24:56

Simple Aggregate Functions

Video
00:26:04

Grouping Aggregations

Video
00:15:50

Windowing Aggregations

Video
00:16:06

Understanding Rank, Dense Rank & Row Number

Video
00:37:23

Understanding Lead and Lag Functions

Video
00:14:56

Analyzing Log Files

Video
00:21:40

Continuation of Analyzing Log Files

Video
00:18:58

Optimization - Pivot Table

Video
00:12:40

Running Notes & Summary Document

Week 8 Running Notes

PDF

Summary Document

PDF

Week 8: Quiz

Apache Spark Architecture

Exercise

Week 8: Assignment

Week 8 Assignment

PDF

Week 8 Assignment Solutions

Week 8 : Top Best Solutions

ZIP

Week 9: Apache Spark Internals

1 Exercises19 Learning Materials

Downloadable Resources

Week 9 : Reference Jupyter Notebooks

ZIP

Program File - Spark Submit

ZIP

Spark Internals & Dataframe Writer API

Dataframe Writer API

Video
00:40:15

PartitionBy Clause

Video
00:26:32

More on Partition's Performance Benefits

Video
00:37:22

Understanding Bucketing and its Performance Gains

Video
00:26:18

Accessing Spark UI using Databricks Community Edition

Video
00:24:54

Spark Internals

Video
00:30:04

Continuation of Spark Internals

Video
00:39:00

Disabling Dynamic Executor Allocation

Video
00:21:27

Spark-Submit at a High-Level

Video
00:42:04

Evaluating the Initial Partitions in a Dataframe

Video
00:23:41

Calculating the Initial Number of Partitions for a Single Non-Splitable file

Video
00:31:05

Calculating the Initial Number Of Partitions for Multiple Files

Video
00:28:09

Running Notes & Summary Document

Week 9 Running Notes

PDF

Summary Document

PDF

Week 9: Quiz

Apache Spark Internals

Exercise

Week 9: Assignment

Weekly Assignment

PDF

Assignment Solution

PDF

Assignment Solution Program files

ZIP

Week 10: Apache Spark Optimizations 1

1 Exercises15 Learning Materials

Downloadable Resources

Week 10 : Reference Jupyter Notebooks

ZIP

PySpark Optimizations

Internals of groupBy

Video
00:25:37

Normal Join Vs Broadcast Join

Video
00:18:31

More on Broadcast Join

Video
00:16:25

Different types of Joins

Video
00:16:14

Partition Skew

Video
00:25:39

3 Use-cases & Better Optimizations

Video
00:19:07

Adaptive Query Execution (AQE)

Video
00:39:01

More on Join Types

Video
00:26:10

Join Strategies

Video
00:37:43

Optimizing Join of 2 Large Tables - Bucketing

Video
00:29:53

Running Notes & Summary Document

Week 10 Running Notes

PDF

Summary Document

PDF

Week 10: Quiz

Apache Spark Optimizations 1

Exercise

Week 10: Assignment

Weekly Assignment

PDF

Assignment Solution

PDF

Saturday Live Session Recordings

16 Learning Materials

Recordings of Saturday Live Sessions

8th Feb 2025 - Success Story Session

Video
00:49:01

15th Feb 2025 - Getting Started with AI for Data Engineers

Video
00:54:29

22nd Feb 2025 - Career Guidance Session

Video
00:54:28

Success story of Nehal Jaiswal 1st March 2025

Video
00:48:12

Live Session by Sumit Sir- Career Guidance - 15th March 2025

Video
01:23:13

Live Session by Sumit Sir- Career Guidance - 22nd March 2025

Video
01:28:02

How to get more interview calls ( in India and Outside India)-29 march 2025

Video
00:57:49

Gen AI webinar 5th April

Video
01:03:16

Live Session by Sumit Sir - Career Guidance - 12th April 2025

Video
01:21:25

Saturday Live Session - LinkedIn Profile Building - 26th April 2025

Video
01:01:27

Saturday Live Session by Sumit Sir - Career Counseling- 3rd May 2025

Video
01:22:39

Gen AI for Data Engineers - Live session by Sumit Sir - 10th May 2025

Video
01:05:28

Live Session with Sumit Sir- Career Counseling - 24th May 2025

Video
01:08:47

Live session- Success story of Shubharee who Joined Microsoft - 31st May

Video
00:48:20

Live session by Sumit Sir - Career Counseling Session - 7th June 2025

Video
01:02:14

Live Session by Sumit Sir - Data Engineering + Gen AI (A super solid combination) 14th June 2025

Video
01:08:46

Week 11: Apache Spark Optimizations 2

1 Exercises16 Learning Materials

Downloadable Resources

Reference Jupyter Notebooks

ZIP

Continuation of PySpark Optimizations

Memory Management in Apache Spark

Video
01:08:15

Sort Aggregate Vs Hash Aggregate

Video
00:20:32

Continuation of Sort Vs Hash Aggregate

Video
00:20:27

Apache Spark Logical and Physical Plans

Video
00:32:01

Catalyst Optimizer

Video
00:10:44

File Formats and Compression Techniques

Row Based and Column Based File Formats

Video
00:22:56

Continuation of File Formats

Video
00:37:37

Specialized File Formats - Parquet | Avro | ORC

Video
00:26:36

Continuation of Specialized File Formats

Video
00:50:46

Schema Evolution

Video
00:27:19

Compression Techniques

Video
00:35:17

Running Notes & Summary Document

Running Notes

PDF

Summary Document

PDF

Week 11: Quiz

Apache Spark Optimizations 2

Exercise

Week 11: Assignment

Weekly Assignment

PDF

Assignment Solution

ZIP

Week 12: Apache Spark Project Part 1

1 Exercises15 Learning Materials

Downloadable Resources : Project Datasets and Code

Datasets for the Lending Club Project

Lending Club Project Code : Jupyter Notebooks

ZIP

Lending Club Data Dictionary

ZIP

Lending Club Project

Key Elements of a Project

Video
00:19:13

Example Problem Statements

Video
00:39:37

Agile Methodology

Video
00:28:32

More on Agile Methodology

Video
00:29:22

Lending Club Project Introduction

Video
00:43:32

Lending Club Project Continuation

Video
00:42:56

Lending Club Project: Data Cleaning Session 1

Video
00:59:29

Lending Club Project: Data Cleaning Session 2

Video
00:39:42

Lending Club Project: Data Cleaning Session 3

Video
00:35:13

Lending Club Project: Data Cleaning Session 4

Video
00:18:29

Running Notes & Summary Document

Week 12 Running Notes

PDF

Summary Document

PDF

Quiz

Apache Spark Project Part 1

Exercise

Week 13: Apache Spark Project Part 2

1 Exercises22 Learning Materials

Downloadable Resources

Datasets for the Lending Club Project

Lending Club Project Code : Jupyter Notebooks

ZIP

Practice Datasets - Project Structuring

ZIP

Project Structuring - Code

PDF

Pytest Code File

PDF

Log4j Code

ZIP

Lending Club Project Part 2

Understanding Loan Score Calculation Logic

Video
00:25:23

Processing - Permanent Table Creation on Cleaned Data

Video
00:24:39

Access Patterns - Quick Access(Old Data) & Slow Access(New Data)

Video
00:23:16

Criteria for Loan Score Calculation

Video
00:18:09

Identifying the Bad Data

Video
00:30:33

Segregating the Identified Bad Data from the Normal Data

Video
00:15:21

Processing and Storing the Final Loan Score

Video
00:39:49

Project Structuring

Virtual Environments | Python | Pip

Video
00:43:02

Project Structuring & Execution

Video
00:55:34

Virtual Environment Setup (Python)

PDF

Unit Testing

Identifying and Writing Unit Test Cases | Fixture | Teardown - Yield

Video
00:38:52

Fixture to Check if the Calculated Results match Expected Results | Markers

Video
00:28:07

Parameterized Generic Test Cases

Video
00:15:50

Logging Level in Apache Spark

Implementing Logging Level in Apache Spark

Video
00:27:49

Running Notes & Summary Document

Week 13 Running Notes

PDF

Summary Document

PDF

Quiz

Apache Spark Project Part 2

Exercise

Week 14: GIT GITHUB & CICD

1 Exercises21 Learning Materials

Downloadable Resources

Retail Analysis Project Dataset

ZIP

Git & GitHub

Overview of Git & GitHub

Video
00:27:40

Git Installation | GitHub Account Creation | Visual Studio Installation

Video
00:19:17

Git & Visual Studio Code Installation

PDF

Important Git Commands | Scenario 1 - Project Creation through GitHub (Remote)

Video
00:33:59

Scenario 2 - Project Creation through Git (Local)

Video
00:23:07

Branches in Git

Video
00:26:43

Reverting back to the Previous Code Base

Video
00:21:04

Scenario 3 : Working on Existing Project (Fork Command)

Video
00:40:05

Git Stash Command

Video
00:19:09

Handling Merge Conflicts

Video
00:12:26

Continuous Integration & Continuous Deployment - CICD

Branching Strategy & Stages of CICD

Video
00:25:35

Deploying and Configuring Jenkins Server

Video
00:16:07

Branching Structure

Video
00:11:43

Jenkins Configurations

Video
00:16:03

Creating Sample Jenkins Pipeline

Video
00:16:54

Build | Test | Package & Deploy - Jenkins Pipeline for Project

Video
00:12:39

Continuation of Jenkins Pipeline for Project

Video
00:11:01

Document: Deploying & Configuring Jenkins Server

PDF

Running Notes & Summary Document

Week 14 Running Notes

PDF

Summary Document

PDF

Week 14: Quiz

GIT GITHUB & CICD

Exercise

Week 15: Apache Hive

1 Exercises15 Learning Materials

Apache Hive

Introduction to Apache Hive

Video
00:27:36

Apache Hive Practical

Video
00:23:45

Apache Hive Tables - Managed & External

Video
00:36:04

Apache Hive External Table

Video
00:28:03

Hive Optimizations - Partitioning

Video
00:30:05

Hive Optimizations - Bucketing

Video
00:19:15

Hive Join Optimizations

Video
00:37:57

Hive Join Optimizations Continuation

Video
00:22:23

Hive Transactional Tables | ACID Properties

Video
00:17:31

Hive Transactional Table Practical

Video
00:28:27

Insert Only Transactional Table with ACID Properties

Video
00:20:43

Spark-Hive Integration

Video
00:13:02

Hive MSCK Repair

Video
00:17:14

Running Notes & Summary Document

Running Notes

PDF

Summary Document

PDF

Week 15: Quiz

Apache Hive

Exercise

Milestone 1 Interview Questions - Spark

9 Learning Materials

Pyspark Interview Questions

Estimating Cluster Resources

Video
01:06:24

Managerial Round Interview Questions

Video
00:28:33

Important Interview Questions on Spark Architecture

Video
00:06:33

Spark Project Related Interview Questions

Video
00:27:17

8 Important PySpark Coding Questions

Video
00:28:38

8 Important Pyspark Coding Questions Continuation

Video
00:31:42

20 Most Asked Interview Questions

Video
00:28:56

20 Most Asked Interview Questions Continuation

Video
00:28:37

Running Notes

Milestone 1: Running Notes

PDF

Resume, LinkedIn & Naukri Profile Building

8 Learning Materials

Downloadable Resources

Interview Preparation Downloadable Resources

ZIP

Resume Building

Resume Building - Session 1

Video
00:31:32

Resume Building - Session 2

Video
00:35:40

Resume Building - Session 3

Video
00:21:52

Sample Resumes

Sample Resume 1

PDF

Sample Resume 2

PDF

LinkedIn and Naukri Profile Building

Naukri Profile Building

Video
00:23:57

LinkedIn Profile Building

Video
00:39:17

Azure Cloud Fundamentals

1 Exercises15 Learning Materials

Azure Fundamentals

Important Note

Cloud Fundamentals

Video
00:22:30

Characteristics Of Cloud

Video
00:29:04

Categories of Cloud Services

Video
00:15:29

Cloud Deployment Models

Video
00:12:09

Steps for Azure Account Creation

Video
00:50:08

Azure Global Infrastructure : Datacenter | Region

Video
00:36:25

Azure Global Infrastructure : Availability Zones

Video
00:25:29

Azure Global Infrastructure : Region pair

Video
00:25:31

Azure Virtual Machine

Video
00:28:39

Azure Virtual Machine Scale Set

Video
00:17:58

Azure Availability Set Vs Availability Zone

Video
00:15:53

Running Notes & Summary Document

Running Notes

PDF

Summary Document

PDF

Quiz

Azure Cloud Fundamentals

Exercise

Assignment

Weekly Assignment

Course Instructor

tutor image

Sumit Mittal

128 Courses   •   378205 Students