Explore scraping with BeautifulSoup!

Last update: Oct 05, 2022

Related tags

Overview

beautifulsoup-scrape

Explore scraping with BeautifulSoup!

Part One: Start from Shakespeare

As my professor is a poet (yes, and he teaches me data and database), he loves to give us assignments related to literature.

The start project with BeautifulSoup is scraping the first act of William Shakespeare's The Tempest.

My notebook is shakespeare-scrape.ipynb.

The code includes:

cook a soup doc, or download the html text from a webpage
search certain element like dic/p/ul, or certain attribute like class
locate certain element by .parent or .find_next_sibling()

Part Two: Develop with Supreme Court Decisions

In this case, I scrape the 2020 Supreme Court Decisions.

The notebook is guardian-and-supreme-court.ipynb.

The code includes:

use for loop to print each element in a list
find the link hidden in the attribute
save the output in a list of lists, even a three-deck list

Part Three: More practice with The Guardian

The webpage I scrape is the Best Non-Fiction Books of All Time listed by The Guardian.

The notebook is the same for Part Two!

You will find a surprise if you get the soup doc of that website. Yes! An advertisement hidden in the html!

The code is similar to the last project, but there is more:

list comprehension
list of liiiissssst

Bonus: More Real Shakespeare

In this case, I try to pull out the first 100 lines of Twelfth Night, available here.

The notebook is the same for Part Two!

It's indeed that my professor loves Shakespeare.

I had trouble with this project for a long time because it required each line to contain:

a code for act.scene.line along with whether is the stage direction
the speaker or the last person who spoke prior to the stage direction
a line or stage direction

I figured it out in a very complex way and I believe there is a better way to do it!

Explore scraping with BeautifulSoup!

Related tags

Overview

beautifulsoup-scrape

Part One: Start from Shakespeare

Part Two: Develop with Supreme Court Decisions

Part Three: More practice with The Guardian

Bonus: More Real Shakespeare

Owner

Chuqin

Script used to download data for stocks.

自动完成每日体温上报（Github Actions）

Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

SmartScraper: 简单、自动、快捷的Python网络爬虫

Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

淘宝茅台抢购最新优化版本，淘宝茅台秒杀，优化了茅台抢购线程队列

A powerful annex BUBT, BUBT Soft, and BUBT website scraping script.

Python scraper to check for earlier appointments in Clalit Health Services

A scalable frontier for web crawlers

This program scrapes information and images for movies and TV shows.

Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

A Pixiv web crawler module

Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

抖音批量下载用户所有无水印视频

Python web scrapper

学习强国自动化百分百正确、瞬间答题，分值45分

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

✂️🕷️ Spider-Cut is a Network Mapper Framework (NMAP Framework)

Minimal set of tools to conduct stealthy scraping.

Explore scraping with BeautifulSoup!

Related tags

Overview

beautifulsoup-scrape

Part One: Start from Shakespeare

Part Two: Develop with Supreme Court Decisions

Part Three: More practice with The Guardian

Bonus: More Real Shakespeare

Owner

Chuqin

Script used to download data for stocks.

自动完成每日体温上报（Github Actions）

Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

SmartScraper: 简单、自动、快捷的Python网络爬虫

Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

淘宝茅台抢购最新优化版本，淘宝茅台秒杀，优化了茅台抢购线程队列

A powerful annex BUBT, BUBT Soft, and BUBT website scraping script.

Python scraper to check for earlier appointments in Clalit Health Services

A scalable frontier for web crawlers

This program scrapes information and images for movies and TV shows.

Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

A Pixiv web crawler module

Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

抖音批量下载用户所有无水印视频

Python web scrapper

学习强国 自动化 百分百正确、瞬间答题，分值45分

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

✂️🕷️ Spider-Cut is a Network Mapper Framework (NMAP Framework)

Minimal set of tools to conduct stealthy scraping.

学习强国自动化百分百正确、瞬间答题，分值45分