Introduction to Spark SQL and DataFrames

Introduction to Spark SQL and DataFrames

English | MP4 | AVC 1280×720 | AAC 48KHz 2ch | 1h 53m | 273 MB

Explore DataFrames, a widely used data structure in Apache Spark. DataFrames allow Spark developers to perform common data operations, such as filtering and aggregation, as well as advanced data analysis on large collections of distributed data. With the addition of Spark SQL, developers have access to an even more popular and powerful query language than the built-in DataFrames API. In this course, instructor Dan Sullivan shows how to perform basic operations—loading, filtering, and aggregating data in DataFrames—with the API and SQL, as well as more advanced techniques that are easily performed in SQL. In this section of the course, Dan explains how to join data, eliminate duplicates, and deal with null or NA values. The lessons conclude with three in-depth examples of using DataFrames for data science: exploratory data analysis, time series analysis, and machine learning.

Topics include:

  • Installing Spark and PySpark
  • Setting up a Jupyter notebook
  • Loading data into DataFrames
  • Filtering, aggregating, and saving data
  • Querying and modifying DataFrames with SQL
  • Exploratory data analysis
  • Basic machine learning
Table of Contents

1 Apache Spark SQL and data analysis
2 What you should know
3 Introduction to DataFrames
4 SQL for DataFrames
5 Install Spark
6 Install PySpark
7 Using Jupyter notebooks with PySpark
8 Set up a Jupyter notebook
9 Load data into DataFrames CSV Files
10 Load data into DataFrames JSON Files
11 Basic DataFrame operations
12 Filter data with DataFrame API
13 Aggregate data with DataFrame API
14 Sample data from DataFrames
15 Save data from DataFrames
16 Querying DataFrames with SQL
17 Filtering DataFrames with SQL
18 Aggregating Data with SQL
19 Joining DataFrames with SQL
20 Eliminating duplicates in DataFrames
21 Working with NA values in DataFrames
22 Exploratory data analysis with DataFrames
23 Exploratory data analysis with Spark SQL
24 Timeseries analysis with DataFrames
25 Basic machine learning with DataFrames, part 1
26 Basic machine learning with DataFrames, part 2
27 Next steps