Learn By Example: Hadoop, MapReduce for Big Data problems

Learn By Example: Hadoop, MapReduce for Big Data problems

English | MP4 | AVC 1280×720 | AAC 44KHz 2ch | 13h 44m | 3.82 GB

A hands-on workout in Hadoop, MapReduce and the art of thinking “parallel”

This course is a zoom-in, zoom-out, hands-on workout involving Hadoop, MapReduce and the art of thinking parallel. This course is both broad and deep. It covers the individual components of Hadoop in great detail and also gives you a higher level picture of how they interact with each other. It’s a hands-on workout involving Hadoop, MapReduce. This course will get you hands-on with Hadoop very early on. You’ll learn how to set up your own cluster using both VMs and the Cloud. All the major features of MapReduce are covered, including advanced topics like Total Sort and Secondary Sort. MapReduce completely changed the way people thought about processing Big Data. Breaking down any problem into parallelizable units is an art. The examples in this course will train you to think in parallel.

Hands-on workout involving Hadoop, MapReduce.

What You Will Learn

  • Develop advanced MapReduce applications to process BigData
  • Master the art of thinking parallel and how to break up a task into Map/Reduce transformations
  • Self-sufficiently set up your own mini-Hadoop cluster whether it’s a single node, a physical cluster or in the cloud.
  • Use Hadoop + MapReduce to solve a wide variety of problems : from NLP to Inverted Indices to Recommendations
  • Understand HDFS, MapReduce and YARN and how they interact with each other
  • Understand the basics of performance tuning and managing your own cluster
Table of Contents

Introduction
1 You, this course and Us

Input and Output Formats and Customized Partitioning#
2 Introducing the File Input Format
3 Text And Sequence File Formats
4 Data partitioning using a custom partitioner
5 Make the custom partitioner real in code
6 Total Order Partitioning
7 Input Sampling, Distribution, Partitioning and configuring these
8 Secondary Sort

Recommendation Systems using Collaborative Filtering
9 Introduction to Collaborative Filtering
10 Friend recommendations using chained MR jobs
11 Get common friends for every pair of users – the first MapReduce
12 Top 10 friend recommendation for every user – the second MapReduce

Hadoop as a Database#
13 Structured data in Hadoop
14 Running an SQL Select with MapReduce
15 Running an SQL Group By with MapReduce
16 A MapReduce Join – The Map Side
17 A MapReduce Join – The Reduce Side
18 A MapReduce Join – Sorting and Partitioning
19 A MapReduce Join – Putting it all together

K-Means Clustering
20 What is K-Means Clustering
21 A MapReduce job for K-Means Clustering
22 K-Means Clustering – Measuring the distance between points
23 K-Means Clustering – Custom Writables for Input_Output
24 K-Means Clustering – Configuring the Job
25 K-Means Clustering – The Mapper and Reducer
26 K-Means Clustering – The Iterative MapReduce Job

Setting up a Hadoop Cluster#
27 Manually configuring a Hadoop cluster (Linux VMs)
28 Getting started with Amazon Web Servicies
29 Start a Hadoop Cluster with Cloudera Manager on AWS

Appendix
30 Setup a Virtual Linux Instance (For Windows users)
31 [For Linux_Mac OS Shell Newbies] Path and other Environment Variables

Why is Big Data a Big Deal
32 The Big Data Paradigm
33 Serial vs Distributed Computing
34 What is Hadoop
35 HDFS or the Hadoop Distributed File System
36 MapReduce Introduced
37 YARN or Yet Another Resource Negotiator

Installing Hadoop in a Local Environment
38 Hadoop Install Modes
39 Hadoop Standalone mode Install
40 Hadoop Pseudo-Distributed mode Install

The MapReduce ‘Hello World’#
41 The basic philosophy underlying MapReduce
42 MapReduce – Visualized And Explained
43 MapReduce – Digging a little deeper at every step
44 Hello World in MapReduce
45 The Mapper
46 The Reducer
47 The Job

Run a MapReduce Job
48 Get comfortable with HDFS
49 Run your first MapReduce Job

Juicing your MapReduce – Combiners, Shuffle and Sort and The Streaming API#
50 Parallelize the reduce phase – use the Combiner
51 Not all Reducers are Combiners
52 How many mappers and reducers does your MapReduce have
53 Parallelizing reduce using Shuffle And Sort
54 MapReduce is not limited to the Java language – Introducing the Streaming API
55 Python for MapReduce

HDFS and Yarn#
56 HDFS – Protecting against data loss using replication
57 HDFS – Name nodes and why they’re critical
58 HDFS – Checkpointing to backup name node information
59 Yarn – Basic components
60 Yarn – Submitting a job to Yarn
61 Yarn – Plug in scheduling policies
62 Yarn – Configure the scheduler

MapReduce Customizations For Finer Grained Control#
63 Setting up your MapReduce to accept command line arguments
64 The Tool, ToolRunner and GenericOptionsParser
65 Configuring properties of the Job object
66 Customizing the Partitioner, Sort Comparator, and Group Comparator

The Inverted Index, Custom Data Types for Keys, Bigram Counts and Unit Tests!#
67 The heart of search engines – The Inverted Index
68 Generating the inverted index using MapReduce
69 Custom data types for keys – The Writable Interface
70 Represent a Bigram using a WritableComparable
71 MapReduce to count the Bigrams in input text
72 Test your MapReduce job using MRUnit