Web crawling framework based on asyncio.

Last update: Jan 05, 2023

Overview

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

Requirements

Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

Write spider.py:

from gain import Css, Item, Parser, Spider
import aiofiles

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        async with aiofiles.open('scrapinghub.txt', 'a+') as f:
            await f.write(self.results['title'])


class MySpider(Spider):
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spider


class Post(Item):
    title = Css('.breadcrumb_last')

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = 'https://mydramatime.com/europe-and-us-drama/'
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    parsers = [
               XPathParser('//span[@class="category-name"]/a/@href'),
               XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
               XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
              ]
    proxy = 'https://localhost:1234'

MySpider.run()

You can add proxy setting to spider as above.

Run python spider.py
Result:

Example

The examples are in the /example/ directory.

Contribution

Pull request.
Open issue.

Web crawling framework based on asyncio.

Related tags

Overview

Requirements

Installation

Usage

Example

Contribution

Owner

Jiuli Gao

Haphazard scripts for scraping bitcoin/bitcoin data from GitHub

A powerful annex BUBT, BUBT Soft, and BUBT website scraping script.

Console application for downloading images from Reddit in Python

Scrapes Every Email Address of Every Society in Every University

crypto currency scraping

Proxy scraper. Format: IP | PORT | COUNTRY | TYPE

Pro Football Reference Game Data Webscraper

script to scrape direct download links (ddls) from google drive index.

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

Current Antarctic large iceberg positions derived from ASCAT and OSCAT-2

Binance Smart Chain Contract Scraper + Contract Evaluator

Grab the changelog from releases on Github

薅薅乐 - JD 测试脚本

This is a webscraper for a specific website

A web scraper that exports your entire WhatsApp chat history.

Script used to download data for stocks.

This Spider/Bot is developed using Python and based on Scrapy Framework to Fetch some items information from Amazon

Amazon scraper using scrapy, a python framework for crawling websites.

Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot