Spark in Action Video Edition

Spark in Action Video Edition
Spark in Action Video Edition

English | MP4 | AVC 1280×720 | AAC 44KHz 2ch | 15h 44m | 2.59 GB
eLearning | Skill level: All Levels

Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. You’ll get comfortable with the Spark CLI as you work through a few introductory examples. Then, you’ll start programming Spark using its core APIs. Along the way, you’ll work with structured data using Spark SQL, process near-real-time streaming data, apply machine learning algorithms, and munge graph data using Spark GraphX. For a zero-effort startup, you can download the preconfigured virtual machine ready for you to try the book’s code.

Big data systems distribute datasets across clusters of machines, making it a challenge to efficiently query, stream, and interpret them. Spark can help. It is a processing system designed specifically for distributed data. It provides easy-to-use interfaces, along with the performance you need for production-quality analytics and machine learning. Spark 2 also adds improved programming APIs, better performance, and countless other upgrades.


  • Updated for Spark 2.0
  • Real-life case studies
  • Spark DevOps with Docker
  • Examples in Scala, and online in Java and Python
+ Table of Contents

1 Introduction to Apache Spark
2 What Spark brings to the table
3 Spark components
4 Spark program flow
5 Setting up the spark-in-action VM
6 Spark fundamentals
7 Using the VM’s Hadoop installation
8 Using Spark shell and writing your first Spark program
9 Basic RDD actions and transformations
10 Using the distinct and flatMap transformations
11 Obtaining RDD’s elements with the sample, take, and takeSample operations
12 Double RDD functions
13 Writing Spark applications
14 Developing the application
15 Running the application from Eclipse
16 Broadcast variables
17 Submitting the application
18 Using spark-submit
19 The Spark API in depth
20 Basic pair RDD functions
21 Using the flatMapValues transformation to add values to keys
22 Understanding data partitioning and reducing data shuffling
23 Understanding and avoiding unnecessary shuffling
24 Repartitioning RDDs
25 Joining, sorting, and grouping data
26 Joining data
27 Sorting data
28 Grouping data
29 Understanding RDD dependencies
30 Using accumulators and broadcast variables to communicate with Spark executors
31 Sending data to executors using broadcast variables
32 Sparkling queries with Spark SQL
33 Creating DataFrames from RDDs
34 Creating a DataFrame from an RDD of tuples
35 DataFrame API basics
36 Using SQL functions to perform calculations on data
37 Working with missing values
38 Grouping and joining data
39 Beyond DataFrames – introducing DataSets
40 Table catalog and Hive metastore
41 Executing SQL queries
42 Saving and loading DataFrame data
43 Saving data
44 Catalyst optimizer
45 Ingesting data with Spark Streaming
46 Creating a discretized stream
47 Saving the results to a file
48 Saving the computation state over time
49 Specifying the checkpointing directory
50 Using window operations for time-limited calculations
51 Using external data sources
52 Changing the streaming application to use Kafka
53 Performance of Spark Streaming jobs
54 Structured Streaming
55 Getting smart with MLlib
56 Classification of machine-learning algorithms
57 Linear algebra in Spark
58 Distributed matrices
59 Linear regression
60 Expanding the model to multiple linear regression
61 Analyzing and preparing the data
62 Fitting and using a linear regression model
63 Tweaking the algorithm
64 Plotting residual plots
65 Optimizing linear regression
66 ML – classification and clustering
67 Logistic regression
68 Preparing data to use logistic regression in Spark
69 Training the model
70 Performing k-fold cross-validation
71 Decision trees and random forests
72 Decision trees
73 Random forests
74 Using k-means clustering
75 K-means clustering
76 Summary
77 Connecting the dots with GraphX
78 Transforming graphs
79 Graph algorithms
80 Implementing the A_ search algorithm
81 Implementing the A_ algorithm
82 Summary
83 Running Spark
84 Job and resource scheduling
85 Data-locality considerations
86 Configuring Spark
87 Spark web UI
88 Running Spark on the local machine
89 Running on a Spark standalone cluster
90 Starting the standalone cluster
91 Viewing Spark processes
92 Standalone cluster web UI
93 Specifying extra classpath entries and files
94 Spark History Server and event logging
95 Creating an EC2 standalone cluster
96 Using the EC2 cluster
97 Running on YARN and Mesos
98 Resource scheduling in YARN
99 Configuring Spark on YARN
100 Configuring resources for Spark jobs
101 Finding logs on YARN
102 Running Spark on Mesos
103 Installing and configuring Mesos
104 Mesos resource scheduling
105 Running Spark with Docker
106 Case study – real-time dashboard
107 Running the application
108 Starting the application manually
109 Understanding the source code
110 The StreamingLogAnalyzer project
111 Deep learning on Spark with H2O
112 Using H2O with Spark
113 Performing regression with H2O’s deep learning
114 Building and evaluating a deep-learning model using the Sparkling Water API
115 Performing classification with H2O’s deep learning