a high-performance, lightweight and human friendly serving engine for scrapy

Last update: Mar 01, 2022

Related tags

Overview

scrapy-x (X)

a distributed, scalable and lightweight environment for deploying and running scrapy spiders/projects with no-hassle on commodity hardware, also it is compatible with scrapyd /schedule.json and /daemonstatus.json.

Installation

$ pip install -U git+git://github.com/speakol-ads/scrapy-x.git

Usage

let's assume that you have a project called TestCrawler

cd to TestCrawler
run scrapy x
that is all!

Default Settings

it utilizes your default project settings.py file

# whether to enable debug mode or not
X_DEBUG = True

# the default queue name that the system will use
# actually it will be used as a prefix for its internal
# queues, currently there is only one queue called `X_QUEUE_NAME + '.BACKLOG'`
# which holds all jobs that should be crawled.
X_QUEUE_NAME = 'SCRAPY_X_QUEUE'

# the queue workers
# by default it uses the cpu cores count
# try to adjust it based on your resources & needs
X_QUEUE_WORKERS_COUNT = os.cpu_count()

# the webserver workers count
# the workers count required from uvicorn to spwan
# defaults to the available cpu count
# try to adjust it based on your resources & needs
X_SERVER_WORKERS_COUNT = os.cpu_count()

# the port the http server should listen on
X_SERVER_LISTEN_PORT = 6800

# the host used by the http server to listen on
X_SERVER_LISTEN_HOST = '0.0.0.0'

# whether to enable access log or not
X_ENABLE_ACCESS_LOG = True

# redis host
X_REDIS_HOST = 'localhost'

# redis port
X_REDIS_PORT = 6379

# redis db
X_REDIS_DB = 0

# redis password
X_REDIS_PASSWORD = ''

# the maximum allowed wait time for a running task
# it will be killed after that time.
X_TASK_TIMEOUT = 25

Available Endpoints

as well scrapyd core endpoints like (schedule.json, daemonstatus.json), you have the following too:

GET /

returns some info about the engine like the available spiders and backlog queue length

GET|POST /run/{spider_name}

execute the specified spider in {spider_name} and wait for it to return its result, P.S: any query param and json post data will be passed to the spider as argument -a key=value

GET|POST /enqueue/{spider_name}

adding the specified spider in {spider_name} to the backlog to be executed later, P.S: any query param and json post data will be used as spider argument

Technologies Used

Author

I'm Mohamed, a software engineer who enjoys writing code in his free time, I'm speaking python, php, go, rust and js

My Similar Projects

P.S: star the project if you liked it ^_^

a high-performance, lightweight and human friendly serving engine for scrapy

Related tags

Overview

scrapy-x (X)

Installation

Usage

Default Settings

Available Endpoints

Technologies Used

Author

My Similar Projects

Owner

Speakol Ads

Automatically download and crop key information from the arxiv daily paper.

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）

The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

A Web Scraping Program.

Google Developer Profile Badge Scraper

This script is intended to crawl license information of repositories through the GitHub API.

用python爬取江苏几大高校的就业网站，并提供3种方式通知给用户，分别是通过微信发送、命令行直接输出、windows气泡通知。

a way to scrape a database of all of the isef projects

A Telegram crawler to search groups and channels automatically and collect any type of data from them.

API to parse tibia.com content into python objects.

Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye

A web scraper which checks price of a product regularly and sends price alerts by email if price reduces.

A Spider for BiliBili comments with a simple API server.

🕷 Phone Crawler with multi-thread functionality

Explore scraping with BeautifulSoup!

Web Scraping Instagram photos with Selenium by only using a hashtag.

A web service for scanning media hosted by a Matrix media repository

Minecraft Item Scraper

Automated Linkedin bot that will improve your visibility and increase your network.

Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key.

a high-performance, lightweight and human friendly serving engine for scrapy

Related tags

Overview

scrapy-x (X)

Installation

Usage

Default Settings

Available Endpoints

Technologies Used

Author

My Similar Projects

Owner

Speakol Ads

Automatically download and crop key information from the arxiv daily paper.

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤（从2月份稳定运行至今）

The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

A Web Scraping Program.

Google Developer Profile Badge Scraper

This script is intended to crawl license information of repositories through the GitHub API.

用python爬取江苏几大高校的就业网站，并提供3种方式通知给用户，分别是通过微信发送、命令行直接输出、windows气泡通知。

a way to scrape a database of all of the isef projects

A Telegram crawler to search groups and channels automatically and collect any type of data from them.

API to parse tibia.com content into python objects.

Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye

A web scraper which checks price of a product regularly and sends price alerts by email if price reduces.

A Spider for BiliBili comments with a simple API server.

🕷 Phone Crawler with multi-thread functionality

Explore scraping with BeautifulSoup!

Web Scraping Instagram photos with Selenium by only using a hashtag.

A web service for scanning media hosted by a Matrix media repository

Minecraft Item Scraper

Automated Linkedin bot that will improve your visibility and increase your network.

Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key.

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）