Apache PySpark by Example

Apache PySpark by Example

English | MP4 | AVC 1280×720 | AAC 48KHz 2ch | 1h 58m | 263 MB

Want to get up and running with Apache Spark as soon as possible? If you’re well versed in Python, the Spark Python API (PySpark) is your ticket to accessing the power of this hugely popular big data platform. This practical, hands-on course helps you get comfortable with PySpark, explaining what it has to offer and how it can enhance your data science work. To begin, instructor Jonathan Fernandes digs into the Spark ecosystem, detailing its advantages over other data science platforms, APIs, and tool sets. Next, he looks at the DataFrame API and how it’s the platform’s answer to many big data challenges. Finally, he goes over Resilient Distributed Datasets (RDDs), the building blocks of Spark.

Topics include:

  • Benefits of the Apache Spark ecosystem
  • Working with the DataFrame API
  • Working with columns and rows
  • Leveraging built-in Spark functions
  • Creating your own functions in Spark
  • Working with Resilient Distributed Datasets (RDDs)
Table of Contents

Introduction
1 Apache PySpark
2 What you should know

Introduction to Apache Spark
3 The Apache Spark ecosystem
4 Why Spark
5 Spark origins and Databricks
6 Spark components
7 Partitions transformations lazy evaluations and actions

Technical Setup
8 Set up the lab environment
9 Download a dataset
10 Importing

Working with the DataFrame API
11 The DataFrame API
12 Working with DataFrames
13 Schemas
14 Working with columns
15 Working with rows
16 Challenge
17 Solution

Functions
18 Built-in functions
19 Working with dates
20 User-defined functions
21 Working with joins
22 Challenge
23 Solution

Resilient Distributed Datasets (RDDs)
24 RDDs
25 Working with RDDs

Conclusion
26 Next steps