Apache Spark: Tips, Tricks, & Techniques

Apache Spark: Tips, Tricks, & Techniques

English | MP4 | AVC 1920×1080 | AAC 48KHz 2ch | 2h 26m | 556 MB

Discover proven techniques to create testable, immutable, and easily parallelizable Spark jobs

Apache Spark has been around for quite some time, but do you really know how to get the most out of Spark? This course aims at giving you new possibilities; you will explore many aspects of Spark, some you may have never heard of and some you never knew existed.

In this course you’ll learn to implement some practical and proven techniques to improve particular aspects of programming and administration in Apache Spark. You will explore 7 sections that will address different aspects of Spark via 5 specific techniques with clear instructions on how to carry out different Apache Spark tasks with hands-on experience. The techniques are demonstrated using practical examples and best practices.

By the end of this course, you will have learned some exciting tips, best practices, and techniques with Apache Spark. You will be able to perform tasks and get the best data out of your databases much faster and with ease.

This step-by-step and fast-paced guide will help you learn different techniques you can use to optimize your testing time, speed, and results with a practical approach, take your skills to the next level, and get you up-and-running with Spark.

What You Will Learn

  • Compose Spark jobs from actions and transformations
  • Create highly concurrent Spark programs by leveraging immutability
  • Ways to avoid the most expensive operation in the Spark API—Shuffle
  • How to save data for further processing by picking the proper data format saved by Spark
  • Parallelize keyed data; learn of how to use Spark’s Key/Value API
  • Re-design your jobs to use reduceByKey instead of groupBy
  • Create robust processing pipelines by testing Apache Spark jobs
  • Solve repeated problems by leveraging the GraphX API
Table of Contents

Transformations and Actions
1 The Course Overview
2 Using Spark Transformations to Defer Computations to a Later Time
3 Avoiding Transformations
4 Using reduce and reduceByKey to Calculate Results
5 Performing Actions That Trigger Computations
6 Reusing the Same RDD for Different Actions

Immutable Design
7 Delve into Spark RDDs ParentChild Chain
8 Using RDD in an Immutable Way
9 Using DataFrame Operations to Transform It
10 Immutability in the Highly Concurrent Environment
11 Using Dataset API in an Immutable Way

Avoid Shuffle and Reduce Operational Expenses
12 Detecting a Shuffle in a Processing
13 Testing Operations That Cause Shuffle in Apache Spark
14 Changing Design of Jobs with Wide Dependencies
15 Using keyBy() Operations to Reduce Shuffle
16 Using Custom Partitioner to Reduce Shuffle

Saving Data in the Correct Format
17 Saving Data in Plain Text
18 Leveraging JSON as a Data Format
19 Tabular Formats – CSV
20 Using Avro with Spark
21 Columnar Formats – Parquet

Working with Spark KeyValue API
22 Available Transformations on KeyValue Pairs
23 Using aggregateByKey Instead of groupBy()
24 Actions on KeyValue Pairs
25 Available Partitioners on KeyValue Data
26 Implementing Custom Partitioner

Testing Apache Spark Jobs
27 Separating Logic from Spark Engine – Unit Testing
28 Integration Testing Using SparkSession
29 Mocking Data Sources Using Partial Functions
30 Using ScalaCheck for Property-Based Testing
31 Testing in Different Versions of Spark

Leveraging Spark GraphX API
32 Creating Graph from Datasource
33 Using Vertex API
34 Using Edge API
35 Calculate Degree of Vertex
36 Calculate Page Rank