A high-level distributed crawling framework.

Related tags

Web Crawlingcola
Overview

Cola: high-level distributed crawling framework

Overview

Cola is a high-level distributed crawling framework, used to crawl pages and extract structured data from websites. It provides simple and fast yet flexible way to achieve your data acquisition objective. Users only need to write one piece of code which can run under both local and distributed mode.

Requirements

  • Python2.7 (Python3+ will be supported later)
  • Work on Linux, Windows and Mac OSX

Install

The quick way:

pip install cola

Or, download source code, then run:

python setup.py install

Write applications

Documents will update soon, now just refer to the wiki or weibo application.

Run applications

For the wiki or weibo app, please ensure the installation of dependencies, weibo as an example:

pip install -r /path/to/cola/app/weibo/requirements.txt

Local mode

In order to let your application support local mode, just add code to the entrance as below.

from cola.context import Context
ctx = Context(local_mode=True)
ctx.run_job(os.path.dirname(os.path.abspath(__file__)))

Then run the application:

python __init__.py

Stop the local job by CTRL+C.

Distributed mode

Start master:

coca master -s [ip:port]

Start one or more workers:

coca worker -s -m [ip:port]

Then run the application(weibo as an example):

coca job -u /path/to/cola/app/weibo -r

Coca command

Coca is a convenient command-line tool for the whole cola environment.

master

Kill master to stop the whole cluster:

coca master -k

job

List all jobs:

coca job -m [ip:port] -l

Example as:

list jobs at master: 10.211.55.2:11103
====> job id: 8ZcGfAqHmzc, job description: sina weibo crawler, status: stopped

You can run a job which shown in the list above:

coca job -r 8ZcGfAqHmzc

Actually, you don't have to input the complete job name:

coca job -r 8Z

Part of the job name is fine if there's no conflict.

You can know the status of a running job by:

coca job -t 8Z

The status like counters during running and so on will be output to the terminal.

You can kill a job by the kill command:

coca job -k 8Z

startproject

You can create an application by this command:

coca startproject colatest

Remember, help command will always be helpful:

coca -h

or

coca master -h

Notes

Chinese docs(wiki).

Donation

Cola is a non-profit project and by now maintained by myself, thus any donation will be encouragement for the further improvements of cola project.

Alipay & Paypal: [email protected]

You might also like...
Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

Async Python 3.6+ web scraping micro-framework based on asyncio
Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia 🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

Transistor, a Python web scraping framework for intelligent use cases.
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

PyQuery-based scraping micro-framework.

demiurge PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x. Documentation: http://demiurge.readthedocs.org Installing demiurge $ pip

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

Python framework to scrape Pastebin pastes and analyze them
Python framework to scrape Pastebin pastes and analyze them

pastepwn - Paste-Scraping Python Framework Pastebin is a very helpful tool to store or rather share ascii encoded data online. In the world of OSINT,

This Spider/Bot is developed using Python and based on Scrapy Framework to Fetch some items information from Amazon

- Hello, This Project Contains Amazon Web-bot. - I've developed this bot for fething some items information on Amazon. - Scrapy Framework in Python is

This is a web scraper, using Python framework Scrapy, built to extract data  from the Deals of the Day section on Mercado Livre website.
This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Deals of the Day This is a web scraper, using the Python framework Scrapy, built to extract data such as price and product name from the Deals of the

Comments
  • docs: Fix a few typos

    docs: Fix a few typos

    There are small typos in:

    • cola/cluster/master.py
    • cola/core/bloomfilter/init.py
    • cola/core/opener.py

    Fixes:

    • Should read experimentally rather than experimently.
    • Should read entries rather than enteries.
    • Should read continuously rather than continously.

    Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

    opened by timgates42 0
  • 任务执行完成后为什么始终不退出

    任务执行完成后为什么始终不退出

    Task类的run方法内有两个循环,最外面循环只有在stop事件出现后才出退出, 为什么?

    def run(self):
            try:
                curr_priority = 0
                while not self.stopped.is_set():
                    priority_name = 'inc' if curr_priority == self.n_priorities \
                                        else curr_priority
                    is_inc = priority_name == 'inc'
                    
                    while not self.nonsuspend.wait(5):
                        continue
                    if self.stopped.is_set():
                        break
                    
                    self.logger.debug('start to process priority: %s' % priority_name)
                    
                    last = self.priorities_secs[curr_priority]
                    clock = Clock()
                    runnings = []
                    try:
                        no_budgets_times = 0
                        while not self.stopped.is_set():
                            if clock.clock() >= last:
                                break
                            
                            if not is_inc:
                                status = self._apply(no_budgets_times)
                                if status == CANNOT_APPLY:
                                    break
                                elif status == APPLY_FAIL:
                                    no_budgets_times += 1
                                    if not self._has_not_finished(curr_priority) and \
                                        len(runnings) == 0:
                                        continue
                                    
                                    if self._has_not_finished(curr_priority) and \
                                        len(runnings) == 0:
                                        self._get_unit(curr_priority, runnings)
                                else:
                                    no_budgets_times = 0
                                    self._get_unit(curr_priority, runnings)
                            else:
                                self._get_unit(curr_priority, runnings)
                                
                            if len(runnings) == 0:
                                break
                            if self.is_bundle:
                                self.logger.debug(
                                    'process bundle from priority %s' % priority_name)
                                rest = min(last - clock.clock(), MAX_BUNDLE_RUNNING_SECONDS)
                                if rest <= 0:
                                    break
                                obj = self.executor.execute(runnings.pop(), rest, is_inc=is_inc)
                            else:
                                obj = self.executor.execute(runnings.pop(), is_inc=is_inc)
                                
                            if obj is not None:
                                runnings.insert(0, obj)  
                    finally:
                        self.priorities_objs[curr_priority].extend(runnings)
                        
                    curr_priority = (curr_priority+1) % self.full_priorities
            finally:
                self.counter_client.sync()
                self.save()
    
    opened by brightgems 5
  • 看了下,和上一个issues的log是一样的,应该是mq没有保护好的问题把

    看了下,和上一个issues的log是一样的,应该是mq没有保护好的问题把

    Exception in thread Thread-2: Traceback (most recent call last): File "/usr/local/lib/python2.7/threading.py", line 551, in *bootstrap_inner self.run() File "/usr/local/lib/python2.7/threading.py", line 504, in run self.__target(_self.__args, _self.__kwargs) File "/usr/crawl/code/cola-code/cola/core/mq/__init.py", line 103, in _init_process self.put(objs, flush=flush) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 407, in put self._remote_or_local_batch_put(addr, self.caches[addr]) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 348, in _remote_or_local_batch_put self.mq_node.batch_put(objs) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 151, in batch_put self.put(obs, force=force, priority=priority) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 125, in put priority_store.put(objs, force=force) File "/usr/crawl/code/cola-code/cola/core/mq/store.py", line 291, in put result = self.put_one(obj, force, commit=False) File "/usr/crawl/code/cola-code/cola/core/mq/store.py", line 266, in put_one pos = self._seek_writable_pos(m) File "/usr/crawl/code/cola-code/cola/core/mq/store.py", line 228, in _seek_writable_pos size, = struct.unpack('I', map_handle[pos:pos+4]) TypeError: 'NoneType' object has no attribute 'getitem'

    opened by tottilin 0
Releases(0.1.0beta)
Owner
Xuye (Chris) Qin
Core developer and architect of Mars which is a tensor-based unified framework for large scale data computation, also worked on PyODPS and cola.
Xuye (Chris) Qin
A webdriver-based script for reserving Tsinghua badminton courts.

AutoReserve A webdriver-based script for reserving badminton courts. 使用说明 下载 chromedriver 选择当前Chrome对应版本 安装 selenium pip install selenium 更改场次、金额信息dat

Payne Zhang 4 Nov 09, 2021
Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

Joseph Lai 543 Jan 03, 2023
This script is intended to crawl license information of repositories through the GitHub API.

GithubLicenseCrawler This script is intended to crawl license information of repositories through the GitHub API. Taking a csv file with requirements.

schutera 4 Oct 25, 2022
The first public repository that provides free BUBT website scraping API script on Github.

BUBT WEBSITE SCRAPPING SCRIPT I think this is the first public repository that provides free BUBT website scraping API script on github. When I was do

Md Imam Hossain 3 Feb 10, 2022
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

IST Research 1.1k Jan 06, 2023
Crawler in Python 3.7, 3.8. 3.9. Pypy3

Description Python Crawler written Python 3. (Supports major Python releases Python3.6, Python3.7 and Python 3.8) Installation and Use Setup VirtualEn

Vinit Kumar 2 Mar 12, 2022
Web crawling framework based on asyncio.

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. Requirements Python3.5+ Installation pip install gain pip install uvloo

Jiuli Gao 2k Jan 05, 2023
用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。

crawler_for_university 用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。 环境依赖 wxpy,requests,bs4等库 功能描述 该项目基于python,通过爬虫爬各高校的就业信息网,爬取招聘信

8 Aug 16, 2021
This tool can be used to extract information from any website

WEB-INFO- This tool can be used to extract information from any website Install Termux and run the command --- $ apt-get update $ apt-get upgrade $ pk

1 Oct 24, 2021
Goblyn is a Python tool focused to enumeration and capture of website files metadata.

Goblyn Metadata Enumeration What's Goblyn? Goblyn is a tool focused to enumeration and capture of website files metadata. How it works? Goblyn will se

Gustavo 46 Nov 22, 2022
A module for CME that spiders hashes across the domain with a given hash.

hash_spider A module for CME that spiders hashes across the domain with a given hash. Installation Simply copy hash_spider.py to your CME module folde

37 Sep 08, 2022
Dude is a very simple framework for writing web scrapers using Python decorators

Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-lea

Ronie Martinez 326 Dec 15, 2022
Scrapes proxies and saves them to a text file

Proxy Scraper Scrapes proxies from https://proxyscrape.com and saves them to a file. Also has a customizable theme system Made by nell and Lamp

nell 2 Dec 22, 2021
Twitter Claimer / Swapper / Turbo - Proxyless - Multithreading

Twitter Turbo / Auto Claimer / Swapper Version: 1.0 Last Update: 01/26/2022 Use this at your own descretion. I've only used this on test accounts and

Underscores 6 May 02, 2022
Generate a repository with mirror links for DriveDroid app

DriveDroid Repository Generator Generate a repository for the app that allow boot a PC using ISO files stored on your Android phone Check also an offi

Evgeny 11 Nov 19, 2022
AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

5 Nov 25, 2021
Dictionary - Application focused on word search through web scraping

Dictionary - Application focused on word search through web scraping, in addition to other functions such as dictation, spell and conjugation of syllables.

Juan Manuel 2 May 09, 2022
Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Toxicity comments crawler Crawler job that scrapes comments from social media posts and saves them in a S3 bucket. Twitter Tweets and replies are scra

Douglas Trajano 2 Jan 24, 2022
Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key.

Facebook Scraper Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key. (Currently working 2021) Setup Befo

Encore Shao 2 Dec 27, 2021
Telegram Group Scrapper

this programe is make your work so much easy on telegrame. do you want to send messages on everyone to your group or others group. use this script it will do your work automatically with one click. a

HackArrOw 3 Dec 03, 2022