Spark, Ray, and Python for Scalable Data Science LiveLessons

Spark, Ray, and Python for Scalable Data Science LiveLessons

English | MP4 | AVC 1280×720 | AAC 48KHz 2ch | 7h 23m | 2.70 GB

Conceptual overviews and code-along sessions get you scaling up your data science projects using Spark, Ray, and Python.

Machine learning is moving from futuristic AI projects to data analysis on your desk. You need to go beyond following along in discussions to coding machine learning tasks. Spark, Ray, and Python for Scalable Data Science LiveLessons show you how to scale machine learning and artificial intelligence projects using Python, Spark, and Ray.

Learn How To

  • Integrate Python and distributed computing
  • Scale data processing with Spark
  • Conduct exploratory data analysis with PySpark
  • Utilize parallel computing with Ray
  • Scale machine learning and artificial intelligence applications with Ray

Lesson 1: Introduction to Distributed Computing in Python

Lesson 1 starts with an introduction to the data science process and workflow. It then turns to a bit of history on why frameworks like Spark and Ray are necessary. Next comes a short primer on distributed systems theory. Python-based distributed computing frameworks come up next. Finally, Jonathan begins to explain the Spark ecosystem as well as how Spark compares to Ray.

Lesson 2: Scaling Data Processing with Spark

Lesson 2 goes into detail on the Spark framework beginning with a “Hello World” example of programming with Spark. Then Jonathan turns to the Spark APIs. You get some experience with one of Spark’s primary data structures, the resilient distributed dataset (RDD). Next is key-value pairs and how Spark does operations on them similar to MapReduce. The lesson finishes up with a bit of Spark internals and the overall Spark application lifecycle.

Lesson 3: Exploratory Data Analysis with PySpark

In Lesson 3, Jonathan continues using Spark but now in the context of a larger data science workflow centered around natural language processing (NLP). He starts off with a general introduction to exploratory data analysis (EDA), followed by a quick tour of Jupyter notebooks. Next he discusses how to do EDA with Spark at scale, and then he shows you how to create statistics and data visualizations to summarize data sets. Finally, he tackles the NLP example, showing you how to transform a large corpus of text into numerical representation suitable for machine learning.

Lesson 4: Parallel Computing with Ray

Lesson 4 introduces the Ray programming API, with Jonathan comparing the similarities and differences between the Ray and Spark APIs. You learn how you can distribute functions with Ray, as well as how you can perform operations with distributed classes or objects with Ray actors. Finally, Jonathan finishes up with a large scale simulation to highlight the strengths of the Ray framework.

Lesson 5: Scaling AI Applications with Ray

Lesson 5 discusses how Ray enables you to scale up machine learning and artificial intelligence applications with Python. The lesson starts with the general model training and evaluation process in Python. Then it turns to how Ray enables you to scale both the evaluation and tuning of our models. You see how Ray makes possible very efficient hyperparameter tuning. You also see how, once you have a trained model, Ray can serve predictions from your machine learning model. Finally, the lesson finishes with an introduction to how Ray can enable you to both deploy machine learning models to production and monitor them once they are there.

Table of Contents

1 Spark, Ray, and Python for Scalable Data Science – Introduction
2 Topics
3 Introduction and Materials
4 The Data Science Process
5 A Brief Historical Diversion
6 Distributed Systems Primer
7 Python Distributed Computing Frameworks
8 The What and Why of Spark
9 The Spark Platform
10 Spark versus Ray
11 Topics
12 Course Coding Setup
13 Your First PySpark Job
14 Introduction to RDDs
15 Transformations versus Actions
16 RDD Deep Dive
17 The Spark Execution Context
18 Spark versus Hadoop
19 Spark Application Lifecycle
20 Topics
21 Introduction to Exploratory Data Analysis
22 A Quick Tour of Jupyter Notebooks
23 Parsing Data at Scale
24 Spark DataFrames – Integration into Existing Workflows
25 Scaling Exploratory Data Analysis with Spark
26 Making Sense of Data – Summary Statistics and Data Visualization
27 Working with Text – Introduction to NLP
28 Tokenization and Vectorization with MLlib
29 Topics
30 The What and Why of Ray
31 The Ray Programming Model
32 Parallelizing Functions with Ray Tasks
33 Asynchronous Programming with Actors
34 Cellular Automata and the Game of Life
35 Topics
36 Introduction to Model Evaluation
37 Serializing Data for Machine Learning Applications
38 Cross Validation with scikit-learn
39 Strategies for Tuning Machine Learning Models
40 Grid Search in Python
41 Distributed Hyperparameter Optimization with Ray Tune
42 Resource Efficient Search with Principled Early Stopping
43 Diving Deeper into Ray’s Internals
44 Serving Machine Learning Models
45 Deploying AI Applications with Ray Serve
46 Monitoring Model Performance in Production
47 Spark, Ray, and Python for Scalable Data Science – Summary

Homepage