Big Data Engineering

Categories: Data Science

Wishlist

About Course

This is the last module in the Data Science Track.

This course will teach you the core concepts, processes, and tools of data engineering.
You will learn about the modern data ecosystem and the roles of data engineers, data scientists, and data analysts.
The data engineering ecosystem includes data pipelines, data repositories, and data integration platforms.
You will learn about each of these components and about Big Data and Big Data processing tools.

Here is a breakdown of what you will cover in this course:

Week 1: Big Data Introduction
Week 2 : Hadoop, HDFS and Map Reduce Fundamentals
Week 3 : Apache Spark and PySpark
Week 4 : Hive and Kafka
Week 5 : Capstone and Conclusion

Acknowledgements and Attribution

This course is attributed to 1) IBM’s Introduction to Data Engineering taught by Rav Ahuja 2) Spark and PySpark Udemy Course by Jose’ Portilla. We have added videos to the course to help make harder concepts simpler to understand. Finally, you have notes by Chris Aloo and Zindua technical team shared on Slack or on the resources

Course Content

1.0 Introduction to Big Data
Here you will learn the following concepts : - Data engineer role, technologies, and responsibility - Evolution of Big Data, Examples, Characteristics, Challenges - Big Data Characteristics, Sources, OLTP and OLAP, Operational vs Analytical Big Data - Scaling - Types of Databases: RDBMS, Data Lakes, Data Warehouse

Foundations of Big Data

05:22
Roles in Data Engineering

05:36
Skills in Data Engineering

08:20
The Modern Data Ecosystem

04:51

1.1 Storing Big Data – Data Formats

1.2 Databases

1.3 Big Data Characteristics

1.4 Week 1 ETL Project

2.0 Moving Big Data – Data Pipelines

2.1 Data Streaming – Apache Kafka

Introduction to Streaming Data

09:37
Kafka 101 – Introduction

05:09
Kafka 101 – Your First kafka Application

09:41
Kafka 101 – Topics

05:53
Kafka 101 – Partitioning

04:23
Kafka 101 – Partitioning (Hands On)

03:05
Kafka 101 – Brokers

01:45
Kafka 101 – Replication

02:23
Kafka 101 – Producers

03:09
Kafka 101 – Consumers

06:51
Kafka 101 – Consumers (Hands On)

03:50

2.2 Workflow Orchestration – Apache Airflow

2.3 Week 2 Project

3.0 Processing Big Data 1 – Introduction to the Hadoop Ecosystem

3.1 HDFS architecture and Features

3.2 MapReduce

3.3 Week 3 Project

4.0 Processing Big Data 2 – Fundamentals of Spark and PySpark

4.1 Entry Points, RDDs and DataFrames

4.2 SparkSQL

4.3 Pyspark -Data Transformations

4.4 Optimising Spark

4.5 Test Your Understanding

Student Ratings & Reviews

No Review Yet

About Course

Course Content

Foundations of Big Data

Roles in Data Engineering

Skills in Data Engineering

The Modern Data Ecosystem

1.1 Storing Big Data – Data Formats

Data, Data Formats and Meta Data

Data Sources

1.2 Databases

Databases

Relational Vs Non-Relational Databases

Data Warehouses

Data Lake

Extraction, Transform, and Load Process (ETL)

Bringing all Together : Data Marts, Data Lakes, ETL, and Data Pipelines

1.3 Big Data Characteristics

Transactional vs. Analytical Workloads

OLAP vs OLTP

1.4 Week 1 ETL Project

ETL – A simple Extract, Transform & Load Pipeline

Submit Your Project

2.0 Moving Big Data – Data Pipelines

Data Pipelines Explained

Batch Vs Streaming Pipelines

Lambda Vs Kappa Architectural Patterns

2.1 Data Streaming – Apache Kafka

Introduction to Streaming Data

Kafka 101 – Introduction

Kafka 101 – Your First kafka Application

Kafka 101 – Topics

Kafka 101 – Partitioning

Kafka 101 – Partitioning (Hands On)

Kafka 101 – Brokers

Kafka 101 – Replication

Kafka 101 – Producers

Kafka 101 – Consumers

Kafka 101 – Consumers (Hands On)

2.2 Workflow Orchestration – Apache Airflow

Deep Dive into Apache Airflow

Build An Airflow Data Pipeline To Download Podcasts [Beginner Data Engineer Tutorial]

2.3 Week 2 Project

Apache Airflow Pipeline Project Submission

Submit Your Project

3.0 Processing Big Data 1 – Introduction to the Hadoop Ecosystem

What is Hadoop

Lesson Objectives

Understand the problem Hadoop solves

Understand the Hadoop approach

Understand the Hadoop Project

Hadoop Related Open Source Projects

Learning objectives | Running Hadoop on Desktop/Laptop

Install Hortonworks HDP Sandbox

3.1 HDFS architecture and Features

Learning objectives | The Hadoop Distributed File System

Understand HDFS basics

Use HDFS tools

HDFS administration

3.2 MapReduce

Learning objectives | Hadoop MapReduce

Understand the MapReduce paradigm

Understand how MapReduce works

3.3 Week 3 Project

Hadoop Project

Submit Your Project

4.0 Processing Big Data 2 – Fundamentals of Spark and PySpark

Introduction

The Spark Architecture

Installation using Anaconda

Running Pyspark on Google Colab

4.1 Entry Points, RDDs and DataFrames

Spark Entry Points – SparkSession vs SparkContext

RDDs – Resilient Distributed Dataframes

PySpark DataFrame from RDD

Create PySpark DataFrame

PySpark DataFrame Functions

4.2 SparkSQL

PySpark – createOrReplaceTempView

Pyspark – createGlobalTempView

Pyspark SQL