🥫 The simple, fast, and modern web scraping library

Overview

gazpacho

Travis PyPI PyPI - Python Version Downloads

About

gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies.

Install

Install with pip at the command line:

pip install -U gazpacho

Quickstart

Give this a try:

from gazpacho import get, Soup

url = 'https://scrape.world/books'
html = get(url)
soup = Soup(html)
books = soup.find('div', {'class': 'book-'}, partial=True)

def parse(book):
    name = book.find('h4').text
    price = float(book.find('p').text[1:].split(' ')[0])
    return name, price

[parse(book) for book in books]

Tutorial

Import

Import gazpacho following the convention:

from gazpacho import get, Soup

get

Use the get function to download raw HTML:

url = 'https://scrape.world/soup'
html = get(url)
print(html[:50])
# '<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <met'

Adjust get requests with optional params and headers:

get(
    url='https://httpbin.org/anything',
    params={'foo': 'bar', 'bar': 'baz'},
    headers={'User-Agent': 'gazpacho'}
)

Soup

Use the Soup wrapper on raw html to enable parsing:

soup = Soup(html)

Soup objects can alternatively be initialized with the .get classmethod:

soup = Soup.get(url)

.find

Use the .find method to target and extract HTML tags:

h1 = soup.find('h1')
print(h1)
# <h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>

attrs=

Use the attrs argument to isolate tags that contain specific HTML element attributes:

soup.find('div', attrs={'class': 'section-'})

partial=

Element attributes are partially matched by default. Turn this off by setting partial to False:

soup.find('div', {'class': 'soup'}, partial=False)

mode=

Override the mode argument {'auto', 'first', 'all'} to guarantee return behaviour:

print(soup.find('span', mode='first'))
# <span class="navbar-toggler-icon"></span>
len(soup.find('span', mode='all'))
# 8

dir()

Soup objects have html, tag, attrs, and text attributes:

dir(h1)
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']

Use them accordingly:

print(h1.html)
# '<h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>'
print(h1.tag)
# h1
print(h1.attrs)
# {'id': 'firstHeading', 'class': 'firstHeading', 'lang': 'en'}
print(h1.text)
# Soup

Support

If you use gazpacho, consider adding the scraper: gazpacho badge to your project README.md:

[![scraper: gazpacho](https://img.shields.io/badge/scraper-gazpacho-C6422C)](https://github.com/maxhumber/gazpacho)

Contribute

For feature requests or bug reports, please use Github Issues

For PRs, please read the CONTRIBUTING.md document

Comments
  • .text is empty on Soup creation

    .text is empty on Soup creation

    Describe the bug

    When I create a soup object...

    To Reproduce

    Calling .text returns an empty string:

    from gazpacho import Soup
    
    html = """<p>&pound;682m</p>"""
    
    soup = Soup(html)
    print(soup.text)
    ''
    

    Expected behavior

    Should output:

    print(soup.text)
    '£682m'
    

    Environment:

    • OS: macOS
    • Version: 1.1

    Additional context

    Inspired by this S/O question

    bug hacktoberfest 
    opened by maxhumber 15
  • API suggestion: soup.all(

    API suggestion: soup.all("div") and soup.first("div")

    The default auto behavior of .find() doesn't work for me, because it means I can't trust my code not to start throwing errors if the page I am scraping adds another matching element, or drops the number of elements down to one (triggering a change in return type).

    I know I can do this:

    div = soup.find("div", mode="first")
    # Or this:
    divs = soup.find("div", mode="all")
    

    But having function parameters that change the return type is still a bit weird - not great for code hinting and suchlike.

    Changing how .find() works would be a backwards incompatible change, which isn't good now that you're past the 1.0 release. I suggest adding two new methods instead:

    div = soup.first("div") # Returns a single element
    # Or:
    divs = soup.all("div") # Returns a list of elements
    

    This would be consistent with your existing API design (promoting the mode arguments to first class method names) and could be implemented without breaking existing code.

    opened by simonw 6
  • A select function similar to soups.

    A select function similar to soups.

    Is your feature request related to a problem? Please describe. It's great to be able to run find and then find within the initial result, but it seems more readable to be able to find based on CSS selectors.

    Describe the solution you'd like

    selector = '.foo img.bar'
    soup.select(selector) # this would return any img item with the class "bar" inside of an object with the class "foo"
    
    opened by kjaymiller 5
  • separate find into find and find_one

    separate find into find and find_one

    Is your feature request related to a problem? Please describe. Right now it's hard to reason about the behaviour of the find method. If it finds one element it will return a Soup object, if it finds more than one it will return a list of Soup objects.

    Describe the solution you'd like Separate find into a find method and find_one method.

    Describe alternatives you've considered Keep it and YOLO?

    Additional context Conversation with Michael Kennedy:

    If I were designing the api, i'd have that always return a List[Node] (or whatever the class is). Then add two methods:

    • find() -> List[Node]
    • find_one() -> Optional[Node]
    • one() -> Node (exception if the there are zero or two or more nodes)
    enhancement hacktoberfest 
    opened by maxhumber 5
  • Format/Pretty Print can't handle void tags

    Format/Pretty Print can't handle void tags

    Describe the bug

    Soup can handle and format matched tags no problem:

    from gazpacho import Soup
    html = """<ul><li>Item 1</li><li>Item 2</li></ul>"""
    Soup(html)
    

    Which correctly formats to:

    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
    </ul>
    

    But it can't handle void tags (like img)...

    To Reproduce

    For example, this bit of html:

    html = """<ul><li>Item 1</li><li>Item 2</li></ul><img src="image.png">"""
    Soup(html)
    

    Will fail to format on print:

    <ul><li>Item 1</li><li>Item 2</li></ul><img src="image.png">
    

    Expected behavior

    Ideally Soup formats it as:

    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
    </ul>
    <img src="image.png">
    

    Environment:

    • OS: macOS
    • Version: 1.1

    Additional context

    The problem has to do with the underlying parseString function unable to handle void tags:

    from xml.dom.minidom import parseString as string_to_dom
    string_to_dom(html)
    

    Possible solution, turn void tags into self-closing tags on input, and the transform them back to void tags on print....

    help wanted hacktoberfest 
    opened by maxhumber 4
  • Add release versions to GitHub?

    Add release versions to GitHub?

    $ git tag v0.7.2 && git push --tags 🎉 🎈

    I really like this project. I think that adding releases to the repository can help the project grow in popularity. I'd like to see that!

    opened by naltun 4
  • User Agent Rotation / Faking

    User Agent Rotation / Faking

    Is your feature request related to a problem? Please describe.

    It might be nice if gazpacho had the ability to rotate/fake a user agent

    Describe the solution you'd like

    Sort of like this but more primitive. (Importantly gazpacho does not want to take on any dependencies)

    Additional context

    Right now gazpacho just spoofs the latest Firefox User Agent

    enhancement hacktoberfest 
    opened by maxhumber 3
  • Enable strict matching for find

    Enable strict matching for find

    Describe the bug Right now match has an ability to be strict. This functionality is presently not enable for find.

    To Reproduce Code to reproduce the behaviour:

    from gazpacho import Soup, match
    
    match({'foo': 'bar'}, {'foo': 'bar baz'})
    # True
    
    match({'foo': 'bar'}, {'foo': 'bar baz'}, strict=True)
    # False
    

    Expected behavior The find method should be forgiving (partial match) to protect ease of use, and maintain backwards compatibility, but there should be an argument to enable strict/exact matching that piggybacks on match

    Environment:

    • OS: macOS
    • Version: 0.7.2
    hacktoberfest 
    opened by maxhumber 3
  • Get all the child elements of a Soup object

    Get all the child elements of a Soup object

    Is your feature request related to a problem? Please describe. I would like try to a .children() method in the Soup object that can list all the child elements of the Soup object.

    Describe the solution you'd like I would make a regex pattern to match each inner element and return a list of Soup() objects with those elements. I might also try to make an option for recurse or not.

    Describe alternatives you've considered All that I can think of is doing the same thing mentioned above in the scraping code

    Additional context None

    opened by Vthechamp22 2
  • Improve issue and feature request templates

    Improve issue and feature request templates

    Is your feature request related to a problem? Please describe. Improve the .github issue template

    Describe the solution you'd like I would like a better issue and feature request template in the .github folder. The format I would like is the bolded headings to become proper sections, and the help line below them comments.

    Describe alternatives you've considered None

    Additional context What I would like is instead of:

    ---
    name: Bug report
    about: Create a report to help gazpacho improve
    title: ''
    labels: ''
    assignees: ''
    ---
    
    **Describe the bug**
    A clear and concise description of what the bug is.
    
    **To Reproduce**
    Code to reproduce the behaviour:
    
    ```python
    
    \```
    
    **Expected behavior**
    A clear and concise description of what you expected to happen.
    
    **Environment:**
     - OS: [macOS, Linux, Windows]
     - Version: [e.g. 0.8.1]
    
    **Additional context**
    Add any other context about the problem here.
    

    It should be something like:

    ---
    name: Bug report
    about: Create a report to help gazpacho improve
    title: ''
    labels: ''
    assignees: ''
    ---
    
    ## Describe the bug
    <!-- A clear and concise description of what the bug is. -->
    
    ## To Reproduce
    <!-- Code to reproduce the behaviour: -->
    
    ```python
    # code
    \```
    
    ## Expected behavior
    <!-- A clear and concise description of what you expected to happen. -->
    
    **Environment:**
     - OS: [macOS, Linux, Windows]
     - Version: [e.g. 0.8.1]
    
    ## Additional context
    <!-- Add any other context about the problem here. Delete this section if not applicable -->
    

    Or something like this

    opened by Vthechamp22 2
  • Needs a Render method (like Requests-Html) to allow pulling text rendered by Javascript...

    Needs a Render method (like Requests-Html) to allow pulling text rendered by Javascript...

    Need support for dynamic text rendering...

    Need a method that triggers the Javascript on a page to fire (see https://github.com/psf/requests-html, r.html.render()).

    opened by jasonvogel 0
  • Can't parse some HTML entries

    Can't parse some HTML entries

    Describe the bug

    Can't parse some entries, there are 40 entries for every page, but some are not being parsed correctly.

    Steps to reproduce the issue

    from gazpacho import get, Soup
    
    for i in range(1, 15):
        link = f'https://1337x.to/category-search/aladdin/Movies/{i}/'
        html = get(link)
        soup = Soup(html)
        body = soup.find("tbody")
    
        # extracting all the entries in the body,
        # there are 40 entries for every page, the last one can have less,
        entries = body.find("tr", mode='all')[::-1]
    
        # but for some pages it can't retrives all the entries from some reason
        print(f'{len(entries)} entries -> {link}')
    

    Expected behavior

    See 40 entries for every page

    Environment:

    Arch Linux - 5.13.10-arch1-1 Python - 3.9.6 Gazpacho - 1.1

    opened by NicKoehler 0
  • Finding tags return entire html

    Finding tags return entire html

    Describe the bug

    Using soup.find on particular website(s) returns entire html instead of the matching tag(s)

    Steps to reproduce the issue

    Look for ul tag with attribute class="cves" (<ul class="cves">) on https://mariadb.com/kb/en/security/

    from gazpacho import get, Soup
    endpoint = "https://mariadb.com/kb/en/security/"
    html_dump = Soup.get(endpoint)
    sample = html_dump.find('ul', attrs={'class': 'cves'}, mode='all')
    

    sample contains the contents of an entire html

    Expected behavior

    sample should contain the contents of the tag <ul class "cves">, which in this case would be rows of <li>-s, listing the CVEs and corresponding fixed version in MariaDB, something like:

    <ul class="cves">
      <li>..</li>
      ...
      <li>..</li>
    </ul>
    

    Environment:

    • OS: Ubuntu Linux 18.04
    • Version: gazpacho 1.1, python 3.6.9

    Additional information

    Using BeautifulSoup on the same html_dump did get the job done, although the <li>-tags are weirdly nested together.

    from bs4 import BeautifulSoup
    # html_dump from above Soup.get(endpoint)
    bs_soup = BeautifulSoup(html_dump.html, 'html.parser')
    ul_cves = bs_soup.find_all('ul','cves')
    

    ul_cves contain strangely nested <li>-s, from which it was still possible to extract the rows of <li>-s I was looking for.

    <ul class="cves">
      <li>
        <li>
        ...
      </li></li>
    </ul>
    
    opened by jz-ang 0
  • Support not a utf-8 encoding

    Support not a utf-8 encoding

    Thank you for your nice project!

    Please add an argument encoding to decode that does not utf-8 encoded pages. https://github.com/maxhumber/gazpacho/blob/ecd53aff4e3d8bdf9eaaea4e0244a75cbabf6259/gazpacho/get.py#L51

    I tried EUC-KR encoded page and got an error message.

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbd in position 95: invalid start byte
    
    opened by KwangYeol 0
  • attrs method output is changed when using find

    attrs method output is changed when using find

    find changes the content of attrs

    When using the find method on a Soup object, the content of attrs is overwritten by the parameter attrs in find.

    Steps to reproduce the issue

    Try the following:

    from gazpacho import Soup
    
    div = Soup("<div id='my_id' />").find("div")
    print(div.attrs)
    div.find("span", {"id": "invalid_id"})
    print(div.attrs)
    

    The expected output will be the following, because we twice print the attributes of a:

    {'id': 'my_id'}
    {'id': 'my_id'}
    

    But instead you actually receive:

    {'id': 'my_id'}
    {'id': 'invalid_id'}
    

    which is wrong.

    Environment:

    • OS: Linux
    • Version: 1.1

    My current workaround is to save the attributes before I execute find.

    opened by cfrahnow 1
  • Can't install whl files

    Can't install whl files

    Describe the bug

    Hi,

    There was a pull request (https://github.com/maxhumber/gazpacho/pull/48) to add whl publishing but it appears to have been lost somewhere in a merge on October 31st, 2020. (https://github.com/maxhumber/gazpacho/compare/v1.1...master). Therefore, no wheels have been published for 1.1.

    This causes the installation error on my system that the PR was meant to address.

    Expected behavior

    Install gazpacho with a wheel, not a tar.gz;. Please re-add the whl publishing.

    Environment:

    • OS: Windows 10
    opened by daddycocoaman 0
Releases(v1.1)
  • v1.1(Oct 9, 2020)

  • v1.0(Sep 24, 2020)

    1.0 (2020-09-24)

    • Feature: gazpacho is now fully baked with type hints (thanks for the suggestion @ju-sh!)
    • Feature: Soup.get("url") alternative initializer
    • Fixed: .find is now able to capture malformed void tags (<img />, vs. <img>) (thanks for the Issue @mallegrini!)
    • Renamed: .find(..., strict=) is now find(..., partial=)
    • Renamed: .remove_tags is now .strip
    Source code(tar.gz)
    Source code(zip)
  • v0.9.4(Jul 7, 2020)

    0.9.4 (2020-07-07)

    • Feature: automagical json-to-dictionary return behaviour for get
    • Improvement: automatic missing URL protocol inference for get
    • Improvement: condensed HTTPError Exceptions
    Source code(tar.gz)
    Source code(zip)
  • v0.9.3(Apr 29, 2020)

  • v0.9.2(Apr 21, 2020)

  • v0.9.1(Feb 16, 2020)

  • v0.9(Nov 25, 2019)

  • v0.8.1(Oct 11, 2019)

  • v0.8(Oct 7, 2019)

    Changelog

    • Added mode argument to the find method to adjust return behaviour (defaults to mode='auto')
    • Enabled strict attribute matching for the find method (defaults to strict=False)
    Source code(tar.gz)
    Source code(zip)
茅台抢购最新优化版本,茅台秒杀,优化了抢购协程队列

茅台抢购最新优化版本,茅台秒杀,优化了抢购协程队列

MaoTai 33 Sep 03, 2022
This project was created using Python technology and flask tools to scrape a music site

python-scrapping This project was created using Python technology and flask tools to scrape a music site You need to install the following packages to

hosein moradi 1 Dec 07, 2021
An arxiv spider

An Arxiv Spider 做为一个cser,杰出男孩深知内核对连接到计算机上的硬件设备进行管理的高效方式是中断而不是轮询。每当小伙伴发来一篇刚挂在arxiv上的”热乎“好文章时,杰出男孩都会感叹道:”师兄这是每天都挂在arxiv上呀,跑的好快~“。于是杰出男孩找了找 github,借鉴了一下其

Jie Liu 11 Sep 09, 2022
Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Crawler Rottentomatoes, Goodreads and IMDB sites crawler. Crawler written by beautifulsoup, selenium and lxml to gather books and films information an

Faeze Ghorbanpour 1 Dec 30, 2021
Pro Football Reference Game Data Webscraper

Pro Football Reference Game Data Webscraper Code Copyright Yeetzsche This is a simple Pro Football Reference Webscraper that can either collect all ga

6 Dec 21, 2022
Scrapegoat is a python library that can be used to scrape the websites from internet based on the relevance of the given topic irrespective of language using Natural Language Processing

Scrapegoat is a python library that can be used to scrape the websites from internet based on the relevance of the given topic irrespective of language using Natural Language Processing. It can be ma

10 Jul 06, 2022
Web scrapping tool written in python3, using regex, to get CVEs, Source and URLs.

searchcve Web scrapping tool written in python3, using regex, to get CVEs, Source and URLs. Generates a CSV file in the current directory. Uses the NI

32 Oct 10, 2022
This script is intended to crawl license information of repositories through the GitHub API.

GithubLicenseCrawler This script is intended to crawl license information of repositories through the GitHub API. Taking a csv file with requirements.

schutera 4 Oct 25, 2022
Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

Jason Nguyen 1 Oct 29, 2021
Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc).

Amit 6 Aug 26, 2022
抖音批量下载用户所有无水印视频

Douyincrawler 抖音批量下载用户所有无水印视频 Run 安装python3, 安装依赖

28 Dec 08, 2022
Works very well and you can ask for the type of image you want the scrapper to collect.

Works very well and you can ask for the type of image you want the scrapper to collect. Also follows a specific urls path depending on keyword selection.

Memo Sim 1 Feb 17, 2022
A Telegram crawler to search groups and channels automatically and collect any type of data from them.

Introduction This is a crawler I wrote in Python using the APIs of Telethon months ago. This tool was not intended to be publicly available for a numb

39 Dec 28, 2022
A Python package that scrapes Google News article data while remaining undetected by Google.

A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https

Geminid Systems, Inc 6 Aug 10, 2022
Python Web Scrapper Project

Web Scrapper Projeto desenvolvido em python, sobre tudo com Selenium, BeautifulSoup e Pandas é um web scrapper que puxa uma tabela com as principais e

Jordan Ítalo Amaral 2 Jan 04, 2022
The first public repository that provides free BUBT website scraping API script on Github.

BUBT WEBSITE SCRAPPING SCRIPT I think this is the first public repository that provides free BUBT website scraping API script on github. When I was do

Md Imam Hossain 3 Feb 10, 2022
A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

TDTV2-Direct Version 1.00.1 • A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com :) How to Works?? install all dependancies v

Danushka-Madushan 1 Nov 28, 2021
This Spider/Bot is developed using Python and based on Scrapy Framework to Fetch some items information from Amazon

- Hello, This Project Contains Amazon Web-bot. - I've developed this bot for fething some items information on Amazon. - Scrapy Framework in Python is

Khaled Tofailieh 4 Feb 13, 2022
Scrapes all articles and their headlines from theonion.com

The Onion Article Scraper Scrapes all articles and their headlines from the satirical news website https://www.theonion.com Also see Clickhole Article

0 Nov 17, 2021
Download images from forum threads

Forum Image Scraper Downloads images from forum threads Only works with forums which doesn't require a login to view and have an incremental paginatio

9 Nov 16, 2022