Hands-On PySpark for Big Data Analysis

Hands-On PySpark for Big Data Analysis

English | MP4 | AVC 1920×1080 | AAC 48KHz 2ch | 1h 52m | 664 MB

Use PySpark to productionize analytics over Big Data and easily crush messy data at scale

Data is an incredible asset, especially when there are lots of it. Exploratory data analysis, business intelligence, and machine learning all depend on processing and analyzing Big Data at scale.

How do you go from working on prototypes on your local machine, to handling messy data in production and at scale?

This is a practical, hands-on course that shows you how to use Spark and it’s Python API to create performant analytics with large-scale data. Don’t reinvent the wheel, and wow your clients by building robust and responsible applications on Big Data.

This hands-on course is divided into clear bite-size chunks so you can learn at your own pace and focus on the areas of most interest to you. It’s practical and packed with step-by-step instructions, working examples, and helpful advice from our expert author. You will learn how PySpark provides an easy to use, performant way to do data analysis with Big Data.

What You Will Learn

  • Work on real-life messy datasets with PySpark to get practical Big Data experience
  • Design for both offline and online use cases with Spark Notebooks to increase productivity
  • Analyse and discover patterns with Spark SQL to improve your business intelligence
  • Get rapid-fire feedback with PySpark’s interactive shell to speed up development time
  • Quickly iterate through your solution by setting up PySpark for your own computer
  • Using Spark Notebooks to quickly iterate through your new ideas
Table of Contents

INSTALL PYSPARK AND SETUP YOUR DEVELOPMENT ENVIRONMENT
The Course Overview
Core Concepts in Spark and PySpark
Setting Up Spark on Windows and PySpark
SparkContext, SparkConf and Spark Shell

GETTING YOUR BIG DATA INTO THE SPARK ENVIRONMENT USING RDDS
Loading Data onto Spark RDDs
Parallelization with Spark RDDs
RDD Operation Basics

BIG DATA CLEANING AND WRANGLING WITH SPARK NOTEBOOKS
Using Spark Notebooks for Quick Iteration of Ideas
Sampling/Filtering RDDs to Pick-Out Relevant Data Points
Splitting Datasets and Creating New Combinations with Set Operations

AGGREGATING AND SUMMARIZING DATA INTO USEFUL REPORTS
Calculating Averages with Map and Reduce
Faster Average Computation with Aggregate
Pivot Tabling with Key-Value Paired Data Points

POWERFUL EXPLORATORY DATA ANALYSIS WITH MLLIB
Computing Summary Statistics with MLlib
Using Pearson and Spearman to Discover Correlations
Testing Your Hypotheses on Large Datasets

PUTTING STRUCTURE ON YOUR BIG DATA WITH SPARKSQL
Manipulating DataFrames with SparkSQL Schemas
Using the Spark DSL to Build Queries for Structured Data Operations