Find thumbnails and original images from URL or HTML file.

Last update: Oct 15, 2022

Related tags

Web Crawling haul

Overview

Haul

Find thumbnails and original images from URL or HTML file.

Demo

Hauler on Heroku

Installation

on Ubuntu

$ sudo apt-get install build-essential python-dev libxml2-dev libxslt1-dev
$ pip install haul

on Mac OS X

$ pip install haul

Fail to install haul? It is probably caused by lxml.

Usage

Find images from img src, a href and even background-image:

import haul

url = 'http://gibuloto.tumblr.com/post/62525699435/fuck-yeah'
result = haul.find_images(url)

print(result.image_urls)
"""
output:
[
    'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_500.png',
    ...
    'http://24.media.tumblr.com/avatar_a3a119b674e2_16.png',
    'http://25.media.tumblr.com/avatar_9b04f54875e1_16.png',
    'http://31.media.tumblr.com/avatar_0acf8f9b4380_16.png',
]
"""

Find original (or bigger size) images with extend=True:

import haul

url = 'http://gibuloto.tumblr.com/post/62525699435/fuck-yeah'
result = haul.find_images(url, extend=True)

print(result.image_urls)
"""
output:
[
    'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_500.png',
    ...
    'http://24.media.tumblr.com/avatar_a3a119b674e2_16.png',
    'http://25.media.tumblr.com/avatar_9b04f54875e1_16.png',
    'http://31.media.tumblr.com/avatar_0acf8f9b4380_16.png',
    # bigger size, extended from above urls
    'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_1280.png',
    ...
    'http://24.media.tumblr.com/avatar_a3a119b674e2_128.png',
    'http://25.media.tumblr.com/avatar_9b04f54875e1_128.png',
    'http://31.media.tumblr.com/avatar_0acf8f9b4380_128.png',
]
"""

Advanced Usage

Custom finder / extender pipeline:

's data-src attribute """ now_finder_image_urls = [] for img in soup.find_all('img'): src = img.get('data-src', None) if src: src = str(src) now_finder_image_urls.append(src) output = {} output['finder_image_urls'] = finder_image_urls + now_finder_image_urls return output MY_FINDER_PIPELINE = ( 'haul.finders.pipeline.html.img_src_finder', 'haul.finders.pipeline.css.background_image_finder', img_data_src_finder, ) GOOGLE_SITES_EXTENDER_PIEPLINE = ( 'haul.extenders.pipeline.google.blogspot_s1600_extender', 'haul.extenders.pipeline.google.ggpht_s1600_extender', 'haul.extenders.pipeline.google.googleusercontent_s1600_extender', ) url = 'http://fashion-fever.nl/dressing-up/' h = Haul(parser='lxml', finder_pipeline=MY_FINDER_PIPELINE, extender_pipeline=GOOGLE_SITES_EXTENDER_PIEPLINE) result = h.find_images(url, extend=True)">

from haul import Haul
from haul.compat import str


def img_data_src_finder(pipeline_index,
                        soup,
                        finder_image_urls=[],
                        *args, **kwargs):
    """
    Find image URL in 's data-src attribute
    """

    now_finder_image_urls = []

    for img in soup.find_all('img'):
        src = img.get('data-src', None)
        if src:
            src = str(src)
            now_finder_image_urls.append(src)

    output = {}
    output['finder_image_urls'] = finder_image_urls + now_finder_image_urls

    return output

MY_FINDER_PIPELINE = (
    'haul.finders.pipeline.html.img_src_finder',
    'haul.finders.pipeline.css.background_image_finder',
    img_data_src_finder,
)

GOOGLE_SITES_EXTENDER_PIEPLINE = (
    'haul.extenders.pipeline.google.blogspot_s1600_extender',
    'haul.extenders.pipeline.google.ggpht_s1600_extender',
    'haul.extenders.pipeline.google.googleusercontent_s1600_extender',
)

url = 'http://fashion-fever.nl/dressing-up/'
h = Haul(parser='lxml',
         finder_pipeline=MY_FINDER_PIPELINE,
         extender_pipeline=GOOGLE_SITES_EXTENDER_PIEPLINE)
result = h.find_images(url, extend=True)

Run Tests

$ python setup.py test

Find thumbnails and original images from URL or HTML file.

Related tags

Overview

Haul

Demo

Installation

Usage

Advanced Usage

Run Tests

Owner

Vinta Chen

API which uses discord to scrape NameMC searches/droptime/dropping status of minecraft names

A dead simple crawler to get books information from Douban.

Minimal set of tools to conduct stealthy scraping.

Scrap the 42 Intranet's elearning videos in a single click

Instagram profile scrapper with python

Dailyiptvlist.com Scraper With Python

淘宝、天猫半价抢购，抢电视、抢茅台，干死黄牛党

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

👁️ Tool for Data Extraction and Web Requests.

This is python to scrape overview and reviews of companies from Glassdoor.

jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人

河南工业大学完美校园自动校外打卡

Example of scraping a paginated API endpoint and dumping the data into a DB

Danbooru scraper with python

A Python library for automating interaction with websites.

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

CreamySoup - a helper script for automated SourceMod plugin updates management.

A Very simple free proxy list scraper.

Find thumbnails and original images from URL or HTML file.

Related tags

Overview

Haul

Demo

Installation

Usage

Advanced Usage

Run Tests

Owner

Vinta Chen

API which uses discord to scrape NameMC searches/droptime/dropping status of minecraft names

A dead simple crawler to get books information from Douban.

Minimal set of tools to conduct stealthy scraping.

Scrap the 42 Intranet's elearning videos in a single click

Instagram profile scrapper with python

Dailyiptvlist.com Scraper With Python

淘宝、天猫半价抢购，抢电视、抢茅台，干死黄牛党

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

👁️ Tool for Data Extraction and Web Requests.

This is python to scrape overview and reviews of companies from Glassdoor.

jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人

河南工业大学 完美校园 自动校外打卡

Example of scraping a paginated API endpoint and dumping the data into a DB

Danbooru scraper with python

A Python library for automating interaction with websites.

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

CreamySoup - a helper script for automated SourceMod plugin updates management.

A Very simple free proxy list scraper.

河南工业大学完美校园自动校外打卡