Working with Big Data in Python

Working with Big Data in Python

English | MP4 | AVC 1920×1080 | AAC 44KHz 2ch | 2h 41m | 565 MB

Gain valuable insights from your data by streamlining unstructured data pipelines with Python, Spark, and MongoDB

This course is a comprehensive, practical guide to using MongoDB and Spark in Python, learning how to store and make sense of huge data sets, and performing basic machine learning tasks to make predictions.

MongoDB is one of the most powerful non-relational database systems available offering robust scalability and expressive operations that, when combined with Python data analysis libraries and distributed computing, represent a valuable set of tools for the modern data scientist. NoSQL databases require a new way of thinking about data and scalable queries. Once Mongo queries have been mastered, it is necessary to understand how we can leverage this API in Python’s rich analysis and visualization ecosystem. This course will cover how to use MongoDB, particularly if you are used to SQL databases, with a focus on scalability to large datasets. pyMongo is introduced as the means to interact with a MongoDB database from within Python code and the data structures used to do so are explored. MongoDB uniquely allows for complex operations and aggregations to be run within the query itself and we will cover how to use these operators. While MongoDB itself is built for easy scalability across many nodes as datasets grow, Python is not. Therefore, we cover how we can use Spark with MongoDB to handle more complex machine learning techniques for extremely large datasets. This learning will be applied to several examples of real-world datasets and analyses that can form the basis of your own pipelines, allowing you to quickly get up-and-running with a powerful data science toolkit.

An exhaustive course that carefully covers the fundamental concepts of unstructured data and distributed programming before applying them to examples of typical data science workflows.

This course is divided into clear chunks, so you can learn at your own pace and focus on your own area of interest.

What You Will Learn

  • MongoDB as a non-relational database based on JSON documents
  • Set up cursors in pyMongo as a connector to a MongoDB database
  • Run more complex chaining and aggregation queries
  • Connect to MongoDB in pySpark
  • How to write MongoDB queries using operators and chain these together into aggregation pipelines
  • Real-world examples of using Python and MongoDB in a data pipeline
  • Using Mongo connectors in pySpark for high-performance processing
Table of Contents

01 The Course Overview
02 What Is MongoDB and Why Should I Use It
03 From Tabular Data to JSON Documents
04 MongoDB Indices and Datatypes
05 Setting Up MongoDB and Running Our First MongoDB Query
06 Setting Up pyMongo
07 Using pyMongo Cursors
08 Inserting and Finding Documents
09 Return Codes and Exceptions
10 Using Operators, Updates, and Aggregations
11 Grabbing Weather Data via OpenWeather API
12 Ingesting Weather Data into MongoDB
13 Querying Weather Data from MongoDB
14 What Is Spark and When Do We Need It
15 Data Structures in Spark
16 Data Structures in Spark (Continued)
17 Connecting to MongoDB with PySpark
18 Making Reddit Data Available to PySpark
19 Loading Data from MongoDB in Spark, Transform into Pandas DF
20 Preparing Data for Prediction Task Using spark.ml
21 Predicting Up Votes Using pyspark.ml