Scrapy: Powerful Web Scraping & Crawling with Python

Scrapy: Powerful Web Scraping & Crawling with Python

English | MP4 | AVC 1280×720 | AAC 48KHz 2ch | 10.5 Hours | 3.94 GB

Python Scrapy Tutorial – Learn how to scrape websites and build a powerful web crawler using Scrapy, Splash and Python

In this Scrapy tutorial, you will learn how to install Scrapy. You will also build a basic and advanced spider, and finally learn more about Scrapy architecture. Then you are going to learn about deploying spiders, logging into the websites with Scrapy. We will build a generic web crawler with Scrapy, and we will also integrate Selenium to work with Scrapy to iterate our pages. We will build an advanced spider with option to iterate our pages with Scrapy, and we will close it out using Close function with Scrapy, and then discuss Scrapy arguments. Finally, in this course, you will learn how to save the output to databases, MySQL and MongoDB. There is a dedicated section for diverse web scraping solved exercises… and updating.

One of the main advantages of Scrapy is that it is built on top of Twisted, an asynchronous networking framework. “Asynchronous” means that you do not have to wait for a request to finish before making another one; you can even achieve that with a high level of performance. Being implemented using a non-blocking (aka asynchronous) code for concurrency, Scrapy is really efficient.

It is worth noting that Scrapy tries not only to solve the content extraction (called scraping), but also the navigation to the relevant pages for the extraction (called crawling). To achieve that, a core concept in the framework is the Spider — in practice, a Python object with a few special features, for which you write the code and the framework is responsible for triggering it.

Scrapy provides many of the functions required for downloading websites and other content on the internet, making the development process quicker and less programming-intensive. This Python Scrapy tutorial will teach you how to use Scrapy to build web crawlers and web spiders.

Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.

Scrapy is the most popular tool for web scraping and crawling written in Python. It is simple and powerful, with lots of features and possible extensions.

Python Scrapy Tutorial Topics:

This Scrapy course starts by covering the fundamentals of using Scrapy, and then concentrate on Scrapy advanced features of creating and automating web crawlers. The main topics of this Python Scrapy tutorial are as follows:

  • What Scrapy is, the differences between Scrapy and other Python-based web scraping libraries such as BeautifulSoup, LXML, Requests, and Selenium, and when it is better to use Scrapy.
  • This tutorial starts by how to create a Scrapy project and and then build a basic Spider to scrape data from a website.
  • Exploring XPath commands and how to use it with Scrapy to extract data.
  • Building a more advanced Scrapy spider to iterate multiple pages of a website and scrape data from each page.
  • Scrapy Architecture: the overall layout of a Scrapy project; what each field represents and how you can use them in your spider code.
  • Web Scraping best practices to avoid getting banned by the websites you are scraping.
  • In this Scrapy tutorial, you will also learn how to deploy a Scrapy web crawler to the Scrapy Cloud platform easily. Scrapy Cloud is a platform from Scrapinghub to run, automate, and manage your web crawlers in the cloud, without the need to set up your own servers.
  • This Scrapy tutorial also covers how to use Scrapy for web scraping authenticated (logged in) user sessions, i.e. on websites that require a username and password before displaying data.
  • This course concentrates mainly on how to create an advanced web crawler with Scrapy. We will cover using Scrapy CrawlSpider which is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules. We will also use Link Extractor object which defines how links will be extracted from each crawled page; it allows us to grab all the links on a page, no matter how many of them there are.
  • Furthermore there is a complete section in this Scrapy tutorial to show you how to combine Selenium with Scrapy to create web crawlers of dynamic web pages. When you cannot fetch data directly from the source, but you need to load the page, fill in a form, click somewhere, scroll down and so on, namely if you are trying to scrape data from a website that has a lot of AJAX calls and JavaScript execution to render webpages, it is good to use Selenium along with Scrapy.
  • We will also discuss more functions that Scrapy offers after the spider is done with web scraping, and how to edit and use Scrapy parameters.
  • As the main purpose of web scraping is to extract data, you will learn how to write the output to CSV, JSON, and XML files.
  • Finally, you will learn how to store the data extracted by Scrapy into MySQL and MongoDB databases.
Table of Contents

Scrapy vs. Other Python Web Scraping Frameworks
1 Scrapy vs. Beautiful Soup vs. Selenium
2 Course Tips (Must Read)

Scrapy Installation
3 Linux Scrapy Installation
4 Mac Scrapy Installation
5 Windows Scrapy Installation
6 Scrapy Installation Instructions
7 Python Editor Sublime Text

Building Basic Spider with Scrapy
8 Scrapy Simple Spider – Part 1
9 Scrapy Simple Spider – Part 2
10 Scrapy Simple Spider – Part 3

XPath Syntax
11 Using XPath with Scrapy
12 Tools to Easily Get XPath

Q&A
13 Do you have questions so far

Building More Advanced Spider with Scrapy
14 Scrapy Advanced Spider – Part 1
15 Scrapy Advanced Spider – Part 2
16 Scrapy Advanced Spider – Part 3
17 Scrapy Advanced Spider – Part 4
18 Scrapy Architecture

Web Scraping Best Practices
19 Avoid Getting Banned!

Do you want to scrape a specific website
20 Do you want to scrape a specific website Let’s do it together!

Deploying & Scheduling Scrapy Spider on ScrapingHub
21 ScrapingHub Deploying & Scheduling Scrapy Spiders (UPDATED)

Logging into Websites Using Scrapy
22 Logging into Websites Using Scrapy

Scrapy as a Standalone Script (UPDATED)
23 Scrapy as a Standalone Script (UPDATED)

Building Web Crawler with Scrapy
24 Building Web Crawler with Scrapy

Scrapy with Selenium
25 WhyWhen We Should Use Selenium
26 Selenium WebDriver + Scrapy Selector to Extract URLs
27 Selenium Loading Next for Data Extraction (usable even with JavaScript pages)
28 Getting Data

Scrapy with Splash – JavaScript Websites
29 Splash Prerequisite Install Docker (NEW)
30 Splash Installation (NEW)
31 How to use Splash with Scrapy (NEW)
32 Splash Advanced Project Scraping Baierl.com p.1 (NEW)
33 Splash Advanced Project Scraping Baierl.com p.2 (NEW)
34 Splash Advanced Project Scraping Baierl.com p.3 (NEW)

Scrapy Spider – Bookstore
35 Grabbing URLs
36 Data Extraction

More about Scrapy
37 Scrapy Arguments
38 Scrapy Close Function
39 Scrapy Items

Export Output to Files
40 Scrapy Feed Exports to CSV, JSON, or XML
41 Export Output to Excel
42 Downloading Images with Scrapy Pipelines
43 Renaming Images with Scrapy Pipelines

Scrapy Project #1 Scraping Craigslist Eng Jobs in NY
44 Craigslist Scraper – Overview
45 Creating Scrapy Craigslist Spider
46 Craigslist Scrapy Spider #1 – Titles
47 Craigslist Scrapy Spider #2 – One Page
48 Craigslist Scrapy Spider #3 – Multiple Pages
49 Craigslist Scrapy Spider #4 – Job Descriptions
50 Editing Scrapy settings.py (e.g. throttling, user agent, etc.)
51 Final Scrapy Tutorial, Craigslist Spider Code

Extracting Data to Databases – MySQL & MongoDB
52 Installing MySQL
53 MySQL Installation and Usage
54 Writing Data to MySQL
55 Installing MongoDB
56 MongoDB Installation and Usage
57 Writing Data to MongoDB

Scrapy Project #2 Web Scraping Class-Central.com
58 Scraping Class-Central – Part 1 Subjects (UPDATED)
59 Scraping Class-Central – Part 2 Courses (UPDATED)

Scrapy Advanced Topics
60 Scrapy User Agent
61 Scraping Tables (UPDATED)
62 Scraping JSON Pages
63 Scrapy FormRequest (UPDATED)
64 Using Multiple Proxies with Crawlera (Optional)

Scrapy Project #3 Web Scraping Dynamic Website eplanning.ie
65 ePlanning Scraping Project Overview
66 ePlanning Extracting Initial URLs
67 ePlanning Crawling Internal Pages
68 ePlanning Scrapy Form Requests
69 ePlanning Scraping Data
70 ePlanning Checking Data Existence
71 ePlanning Scraping Data from Table

Project #4 Scraping Shoes’ Prices from API Request
72 Scraping Product Prices from API Request p.1 (NEW)
73 Scraping Product Prices from API Request p.2 (NEW)
74 Scraping Product Prices from API Request p.3 (NEW)

Project #5 Web Scraping LinkedIn.com (UPDATED)
75 LinkedIn Scraping Project Overview & Requirements (UPDATED)
76 LinkedIn Logging in (UPDATED)
77 Finding LinkedIn Profiles Part 1 (UPDATED)
78 Finding LinkedIn Profiles Part 2 (UPDATED)
79 Scraping Data Points from LinkedIn Profiles Part 1 (UPDATED)
80 Scraping Data Points from LinkedIn Profiles Part 2 (UPDATED)
81 Connecting to LinkedIn Profiles (UPDATED)

Solved Web Scraping Exercises
82 Yield Data Items from 2 Functions
83 How to Order Exported Data
84 Xpath contains() and starts-with() functions

Bonus Data Extraction with APIs
85 Data Extraction with APIs (Free Tutorial)

Bonus Web Scraping with Beautiful Soup, Requests & Selenium Course
86 Coupon for Web Scraping with Beautiful Soup, Requests & Selenium & Other Courses