The Ultimate Web Scraping With Python Bootcamp 2023

The Ultimate Web Scraping With Python Bootcamp 2023

English | MP4 | AVC 1280×720 | AAC 44KHz 2ch | 160 lectures (17h 29m) | 6.76 GB

Learn to extract data from the web with python with just one course, covering selectolax, playwright, scrapy and more

Welcome to the Ultimate Web Scraping With Python Bootcamp, the only course you need to go from a complete beginner in python to a very competent web scraper.

Web scraping is the process of programmatically extracting data from the web. Scraping agents visit a web resource, extract content from it, and then process the resulting data in order to parse some specific information of interest.

Scraping is the kind of programming skill that offers immediate feedback, and can be used to automate a wide variety of data collection and processing tasks.

We will methodically cover everything you need to know to write web scraping agents in python.

This bootcamp is organized in three parts of increasing difficulty designed to help you progressively build your skill.

Part I – Begin

We’ll start by understanding how the web works by taking a closer look at HTTP, the key application layer communication protocol of the modern web. Next, we’ll explore HTML, CSS, and JavaScript from first principles to get a deeper understanding of how website are built. Finally, we’ll learn how to use python to send HTTP requests and parse the resulting HTML, CSS, and JavaScript to extract the data we need. Our goal in the first part of the course is to build a solid foundation in both web scraping and python, and put those skills to practice by building functional web scrapers from scratch. Selected topics include:

  • a detailed overview the request-response cycle
  • understanding user-agents, HTTP verbs, headers and statuses
  • understanding why custom headers can often be used to bypass paywalls
  • mastering the requests library to work with HTTP in python
  • what stateless means and how cookies work
  • exploring the role of proxies in modern web architectures
  • mastering beautifulsoup for parsing and data extraction

Part II – Refine

In the second part of the course, we’ll build on the foundation we’ve already laid to explore more advanced topics in web scraping. We’ll learn how to scrape dynamic websites that use JavaScript to render their content, by setting up Microsoft Playwright as a headless browser to automate this process. We’ll also learn how to identify and emulate API calls to scrape data from websites that don’t have formally public APIs. Our projects in this section will include an image scraper that can download a set number of high-resolution images given some keyword, as well as another scraping agent that extracts price and content of discounted video games from a dynamically rendered website. Topics include:

  • identifying and using hidden APIs and understanding the benefits they offer
  • emulating headers, cookies, and body content with ease
  • automatically generating python code from intercepted API requests using postman and httpie
  • working with the highly performant selectolax parsing library
  • mastering CSS selectors
  • introducing Microsoft Playwright for headless browsing and dynamic rendering

Part III – Master

In the final part of the course, we’ll introduce scrapy. This will give us an excellent, time-tested framework for building more complex and robust web scrapers. We’ll learn how to set up scrapy within a virtual environment and how to create spiders and pipelines to extract data from websites in a variety of formats. Having learned how to use scrapy, we’ll then explore how to integrate it with Playwright so that we tackle the challenge of scraping dynamic websites from right within scrapy. We’ll conclude this section by building a scraping agent that executes custom JavaScript code before returning the resulting HTML to scrapy. Some topics from this section:

  • learning how to set up scrapy and explore its command line interface (“the scrapy tool”)
  • dynamically explore response objects using scrapy shell
  • understand and define item schemas and load data using itemloaders and input/output processors
  • integrate Playwright into scrapy to tackle dynamically rendered JavaScript sites
  • write PageMethods to specify highly specific instructions to the headless browser from right within scrapy
  • define custom pipelines for saving into SQL databases and highly customized output formats

In this bootcamp, I will take you step-by-step through engaging video lectures and teach you everything you need to know to get started with web scraping in python.

By the end of this course, you will have a complete toolset to conceptualize and implement scraping agents for any website you can imagine.

What you’ll learn

  • Understand the fundamentals of web scraping in python from absolute scratch
  • Scrape information from static and dynamic websites and extract it to a variety of formats
  • Intercept and emulate hidden APIs to identify highly productive alternatives to getting your data
  • Master the requests library for working with HTTP
  • Parse and extract content from HTML using beautifulsoup, selectolax, and Microsoft Playwright
  • Master complex CSS selectors including descendant, child, sibling combinators
  • Understand how the web works, including HTTP, HTML, CSS, and JavaScript
  • Create scrapy crawlers and practice items, itemloaders and custom pipelines
  • Integrate scrapy with playwright for highly performant, fine-tuned dynamic website crawling
  • Practice processing and extracting data to a variety of formats including csv, json, xml, and SQL
Table of Contents

Introduction
1 Prerequisites
2 A Useful Mental Model
3 All Code Resources

The HTTP Protocol
4 What Is HTTP
5 The Request-Response Cycle
6 Extra But, This Website Remembers Me
7 User-Agents
8 HTTP Verbs
9 Status Codes
10 Headers
11 Extra Headers Do Lie
12 Proxies

HTML, CSS, And JavaScript
13 The Ingredients
14 Markup
15 Attributes
16 Presentation
17 Some More Rules
18 Behaviour
19 More JavaScript
20 JavaScript In Web Scraping
21 Comments
22 Embedded

Web Requests In Python
23 Urllib
24 Requests
25 Setting Headers
26 Query Parameters
27 Authentication And Authorization
28 Aside From GET
29 POSTing Data

Parsing And Extraction
30 BeautifulSoup
31 Tags
32 Parents, Children, And Descendants
33 Siblings
34 Extracting Text
35 All Strings
36 Search
37 Challenge
38 Solution
39 Solution Refinement
40 An Extra pandas
41 Functional Search Patterns
42 Text Search
43 Searching By CSS
44 Just One Tag

Project 1 – Portfolio Valuation With Google Finance
45 Scope Statement
46 An Extra Some Finance Concepts
47 Parsing Price
48 Non-USD Prices
49 Adding Structure With Dataclasses
50 Position And Portfolio
51 Tabular Display

APIs The Hidden Gems
52 Befriend The Network Tab
53 Case Study Coffee Shop Locations
54 The Advantages Of APIs
55 Full Header Emulation
56 An Extra Postman
57 Code Generation
58 Challenge
59 Solution Interacting With The API
60 Solution Processing The Data
61 Solution Adding Geocode

Selectolax And Advanced CSS Selectors
62 Introduction
63 What Is selectolax
64 CSS Combinators
65 Sibling Combinators
66 Selector Types

Project 2 – Image Scraper
67 Scope Statement
68 Prospecting
69 Scraping HTML
70 Filtering Relevant URLs
71 Extracting High-Res Image URLs
72 Saving The Images
73 Stepping It Up With Logging
74 Back To The API
75 Filtered Canonical URLs
76 Pagination Prospecting
77 Wrapping Up

Tackling JavaScript With Microsoft PlayWright
78 What You See vs. What You Get
79 Rendering JavaScript
80 PlayWright Over Selenium
81 Case Study Show Me The Money

Project 3 – Building A Configurable Scraping Pipeline
82 Scope Statement
83 Initial Setup
84 Fully Loaded Site
85 Selecting Game Containers
86 More Robust Render Thresholds
87 Extracting Title And Thumbnail
88 Game Category Tags
89 Release Date And Reviews
90 Original And Discount Price
91 Refactoring
92 Introducing Config
93 Configuration Integrated
94 Parsing Pipeline
95 Parameterized Extraction
96 Functional Post-Processing
97 Date Formatting
98 Regular Expressions
99 Saving To Disk
100 Integrating HTMLParser With The Generic Parser
101 Finishing Touches

The Scrapy Framework
102 Introduction
103 Virtual Environments And Scrapy
104 First Project And Spider
105 Scraping Elements
106 Extracting Specific Attributes
107 An Extra Scrapy Shell
108 Rewriting Using XPath Selectors
109 Outputting Data
110 Defining Scrapy Items
111 Introducing Itemloaders
112 Fine-Tuned Post-Processing
113 Pipelined Data Validation
114 Saving To Databases
115 Challenge
116 Solution Defining NoDuplicateCountryPipeline

Boosting Scrapy With scrapy-playwright
117 The JavaScript Wrench In The Works
118 Integrating scrapy-playwright
119 PageMethods
120 Pagination And Infinite Scroll
121 Playwright, Do This
122 Improved Snippet As PageMethod
123 Scraping Location, Department, And Posted Date

Project 4 – Scraping Dynamic Sites With Scrapy And PlayWright
124 Scope Statement
125 New Project And Spider
126 Item And Itemloading
127 Pipelining To Database
128 Quick Fix
129 Grouped Elements JSON Export

Closing Thoughts
130 Try To Respect robots.txt
131 Thank You
132 My Other Courses

Appendix – Python Fundamentals
133 A Quick Note + Section Resources
134 Data Types
135 Variables
136 Arithmetic And Augmented Assignment Operators
137 Ints And Floats
138 Booleans And Comparison Operators
139 Strings
140 Methods
141 Containers I – Lists
142 Lists vs. Strings
143 List Methods And Functions
144 Containers II – Tuples
145 Containers III – Sets
146 Containers IV – Dictionaries
147 Dictionary Keys And Values
148 Membership Operators
149 Controlling Flow With if, else, And elif
150 Truth Value Of Non-Booleans
151 For Loops
152 The range() Immutable Sequence
153 While Loops
154 Break And Continue
155 Zipping Iterables
156 List Comprehensions
157 Defining Functions
158 Function Arguments Positional vs Keyword
159 Lambdas
160 Importing Modules

Homepage