Apache Spark for Java Developers

Apache Spark for Java Developers

English | MP4 | AVC 1280×720 | AAC 44KHz 2ch | 21.5 Hours | 11.4 GB

Get processing Big Data using RDDs, DataFrames, SparkSQL and Machine Learning – and real time streaming with Kafka!

Get started with the amazing Apache Spark parallel computing framework – this course is designed especially for Java Developers.

If you’re new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You’ll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!

And finally, there’s a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.

Optionally, if you have an AWS account, you’ll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you’re not familiar with AWS you can skip this video, but it’s still worthwhile to watch rather than following along with the coding.

You’ll be going deep into the internals of Spark and you’ll find out how it optimizes your execution plans. We’ll be comparing the performance of RDDs vs SparkSQL, and you’ll learn about the major performance pitfalls which could save a lot of money for live projects.

Throughout the course, you’ll be getting some great practice with Java 8 Lambdas – a great way to learn functional-style Java if you’re new to it.

NOTE: Java 8 is required for the course. Spark does not currently support Java9+ (we will update when this changes) and Java 8 is required for the lambda syntax.

What you’ll learn

  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data
  • See how Structured Streaming can be used to build pipelines with Kafka
Table of Contents

Introduction
1 Welcome
2 Downloading the Code
3 Module 1 – Introduction
4 Spark Architecture and RDDs

Getting Started
5 Warning – Java 91011 is not supported by Spark
6 Installing Spark

Reduces on RDDs
7 Reduces on RDDs

Mapping and Outputting
8 Mapping Operations
9 Outputting Results to the Console
10 Counting Big Data Items
11 If you’ve had a NotSerializableException in Spark

Tuples
12 RDDs of Objects
13 Tuples and RDDs

PairRDDs
14 Overview of PairRDDs
15 Building a PairRDD
16 Coding a ReduceByKey
17 Using the Fluent API
18 Grouping By Key

FlatMaps and Filters
19 FlatMaps
20 Filters

Reading from Disk
21 Reading from Disk

Keyword Ranking Practical
22 Practical Requirements
23 Worked Solution
24 Worked Solution (continued) with Sorting

Sorts and Coalesce
25 Why do sorts not work with foreach in Spark
26 Why Coalesce is the Wrong Solution
27 What is Coalesce used for in Spark

Deploying to AWS EMR (Optional)
28 How to start an EMR Spark Cluster
29 Packing a Spark Jar for EMR
30 Running a Spark Job on EMR
31 Understanding the Job Progress Output
32 Calculating EMR costs and Terminating the cluster

Joins
33 Inner Joins
34 Left Outer Joins and Optionals
35 Right Outer Joins
36 Full Joins and Cartesians

Big Data Big Exercise
37 Introducing the Requirements
38 Walkthrough – Step 8
39 Walkthrough – Step 9, adding titles and using the Big Data file
40 Warmup
41 Main Exercise Requirments
42 Walkthrough – Step 2
43 Walkthrough – Step 3
44 Walkthrough – Step 4
45 Walkthrough – Step 5
46 Walkthrough – Step 6
47 Walkthrough – Step 7

RDD Performance
48 Transformations and Actions
49 The DAG and SparkUI
50 Narrow vs Wide Transformations
51 Shuffles
52 Dealing with Key Skews
53 Avoiding groupByKey and using map-side-reduces instead
54 Caching and Persistence

Module 2 – Chapter 1 SparkSQL Introduction
55 Code for SQLDataFrames Section
56 Introducing SparkSQL

SparkSQL Getting Started
57 SparkSQL Getting Started

Datasets
58 Dataset Basics
59 Filters using Expressions
60 Filters using Lambdas
61 Filters using Columns

The Full SQL Syntax
62 Using a Spark Temporary View for SQL

In Memory Data
63 In Memory Data

Groupings and Aggregations
64 Groupings and Aggregations

Date Formatting
65 Date Formatting

Multiple Groupings
66 Multiple Groupings

Ordering
67 Ordering

DataFrames API
68 SQL vs DataFrames
69 DataFrame Grouping

Pivot Tables
70 How does a Pivot Table work
71 Coding a Pivot Table in Spark

More Aggregations
72 How to use the agg method in Spark

Practical Exercise
73 Building a Pivot Table with Multiple Aggregations

User Defined Functions
74 How to use a Lambda to write a UDF in Spark
75 Using more than one input parameter in Spark UDF
76 Using a UDF in Spark SQL

SparkSQL Performance
77 Understand the SparkUI for SparkSQL
78 How does SQL and DataFrame performance compare
79 Update – Setting spark.sql.shuffle.partitions

HashAggregation
80 Explaining Execution Plans
81 How does HashAggregation work
82 How can I force Spark to use HashAggregation
83 SQL vs DataFrames Performance Results

SparkSQL Performance vs RDDs
84 SparkSQL Performance vs RDDs

Module 3 – SparkML for Machine Learning
85 Welcome to Module 3
86 What is Machine Learning
87 Coming up in this Module – and introducing Kaggle
88 Supervised vs Unsupervised Learning
89 The Model Building Process

Linear Regression Models
90 Introducing Linear Regression
91 Beginning Coding Linear Regressions
92 Assembling a Vector of Features
93 Model Fitting

Training Data
94 Training vs Test and Holdout Data
95 Using data from Kaggle
96 Practical Walkthrough
97 Splitting Training Data with Random Splits
98 Assessing Model Accuracy with R2 and RMSE

Model Fitting Parameters
99 Setting Linear Regression Parameters
100 Training, Test and Holdout Data

Feature Selection
101 Describing the Features
102 Correlation of Fetures
103 Identifying and Eliminating Duplicated Features
104 Data Preparation

Non-Numeric Data
105 Using OneHotEncoding
106 Understanding Vectors

Pipelines
107 Pipelines

Case Study
108 Requirements
109 Case Study – Walkthrough Part 1
110 Case Study – Walkthrough Part 2

Logistic Regression
111 Code for chapters 9-12
112 TrueFalse Negatives and Postives
113 Coding a Logistic Regression

Decision Trees
114 Overview of Decision Trees
115 Building the Model
116 Interpreting a Decision Tree
117 Random Forests

K Means Clustering
118 K Means Clustering

Recommender Systems
119 Overview and Matrix Factorisation
120 Building the Model

Module 4 -Spark Streaming and Structured Streaming with Kafka
121 Welcome to Module 4 – Spark Streaming
122 Streaming Chapter 1 – Introduction to Streaming
123 DStreams
124 Starting a Streaming Job
125 Streaming Transformations
126 Streaming Aggregations
127 SparkUI for Streaming Jobs
128 Windowing Batches

Streaming Chapter 2 – Streaming with Apache Kafka
129 Overview of Kafka
130 Installing Kafka
131 Using a Kafka Event Simulator
132 Integrating Kafka with Spark
133 Using KafkaUtils to access a DStream
134 Writing a Kafka Aggegration
135 Adding a Window
136 Adding a Slide Interval

Streaming Chapter 3- Structured Streaming
137 Structured Streaming Overview
138 Data Sinks
139 Structured Streaming Output Modes
140 Windows and Watermarks
141 What is the Batch Size in Structured Streaming
142 Kafka Structured Streaming Pipelines