Apache Spark with Python – Big Data with PySpark and Spark

Apache Spark with Python – Big Data with PySpark and Spark

English | MP4 | AVC 1280×720 | AAC 44KHz 2ch | 3h 18m | 1.15 GB

Learn Apache Spark and Python by 12+ hands-on examples of analyzing big data with PySpark and Spark

This course covers all the fundamentals of Apache Spark with Python and teaches you everything you need to know about developing Spark applications using PySpark, the Python API for Spark. At the end of this course, you will gain in-depth knowledge about Apache Spark and general big data analysis and manipulations skills to help your company to adopt Apache Spark for building big data processing pipeline and data analytics applications. This course covers 10+ hands-on big data examples. You will learn valuable knowledge about how to frame data analysis problems as Spark problems. Together we will learn examples such as aggregating NASA Apache weblogs from different sources; we will explore the price trend by looking at the real estate data in California; we will write Spark applications to find out the median salary of developers in different countries through the Stack Overflow survey data; we will develop a system to analyze how maker spaces are distributed across different regions in the United Kingdom. And much much more.

This course covers 10+ hands-on big data examples. You will learn valuable knowledge about how to frame data analysis problems as Spark problems.

What You Will Learn

  • An overview of the architecture of Apache Spark.
  • Develop Apache Spark 2.0 applications using RDD transformations and actions and Spark SQL.
  • Work with Apache Spark’s primary abstraction, resilient distributed datasets (RDDs) to process and analyze large data sets
  • Analyze structured and semi-structured data using DataFrames, and develop a thorough understanding about Spark SQL.
  • Advanced techniques to optimize and tune Apache Spark jobs by partitioning, caching and persisting RDDs.
  • Scale up Spark applications on a Hadoop YARN cluster through Amazon’s Elastic MapReduce service.
  • Share information across different nodes on an Apache Spark cluster by broadcast variables and accumulators.
  • Write Spark applications using the Python API – PySpark
Table of Contents

Get Started with Apache Spark
1 Course Overview
2 Introduction to Spark
3 Install Java and Git
4 Set up Spark
5 Run our first Spark job

RDD
6 RDD Basics
7 Create RDDs
8 Map and Filter Transformation
9 Solution to Airports by Latitude Problem
10 FlatMap Transformation
11 Set Operations
12 Solution for the Same Hosts Problem
13 Actions
14 Solution to Sum of Numbers Problem
15 Important Aspects about RDD
16 Summary of RDD Operations
17 Caching and Persistance

Spark Architecture and Components
18 Spark Architecture
19 Spark Components

Pair RDD
20 Introduction to Pair RDD
21 Create Pair RDDs
22 Filter and MapValue Transformations on Pair RDD
23 Reduce By Key Aggregation
24 Solution for the Average House Problem
25 Group By Key Transformation
26 Sort By Key Transformation
27 Solution for the Sorted Word Count Problem
28 Data Partitioning
29 Join Operations

Advanced Spark Topics
30 Accumulators
31 Solution to StackOverflow Survey Follow-up Problem
32 Broadcast Variables

Spark SQL
33 Introduction to Spark SQL
34 Spark SQL in Action
35 Spark SQL Joins
36 Spark SQL practice – House Price Problem
37 Dataframe or RDD
38 Dataframe and RDD Conversion
39 Performance Tuning of Spark SQL

Running Spark in a Cluster
40 Introduction to Running Spark in a Cluster
41 Spark-submit
42 Run Spark Application on Amazon EMR (ElasticMapReduce) cluster

Get Started with Apache Spark
1 Course Overview
2 Introduction to Spark
3 Install Java and Git
4 Set up Spark
5 Run our first Spark job

RDD
6 RDD Basics
7 Create RDDs
8 Map and Filter Transformation
9 Solution to Airports by Latitude Problem
10 FlatMap Transformation
11 Set Operations
12 Solution for the Same Hosts Problem
13 Actions
14 Solution to Sum of Numbers Problem
15 Important Aspects about RDD
16 Summary of RDD Operations
17 Caching and Persistance

Spark Architecture and Components
18 Spark Architecture
19 Spark Components

Pair RDD
20 Introduction to Pair RDD
21 Create Pair RDDs
22 Filter and MapValue Transformations on Pair RDD
23 Reduce By Key Aggregation
24 Solution for the Average House Problem
25 Group By Key Transformation
26 Sort By Key Transformation
27 Solution for the Sorted Word Count Problem
28 Data Partitioning
29 Join Operations

Advanced Spark Topics
30 Accumulators
31 Solution to StackOverflow Survey Follow-up Problem
32 Broadcast Variables

Spark SQL
33 Introduction to Spark SQL
34 Spark SQL in Action
35 Spark SQL Joins
36 Spark SQL practice – House Price Problem
37 Dataframe or RDD
38 Dataframe and RDD Conversion
39 Performance Tuning of Spark SQL

Running Spark in a Cluster
40 Introduction to Running Spark in a Cluster
41 Spark-submit
42 Run Spark Application on Amazon EMR (ElasticMapReduce) cluster