A high-level distributed crawling framework.

Related tags

Web Crawlingcola
Overview

Cola: high-level distributed crawling framework

Overview

Cola is a high-level distributed crawling framework, used to crawl pages and extract structured data from websites. It provides simple and fast yet flexible way to achieve your data acquisition objective. Users only need to write one piece of code which can run under both local and distributed mode.

Requirements

  • Python2.7 (Python3+ will be supported later)
  • Work on Linux, Windows and Mac OSX

Install

The quick way:

pip install cola

Or, download source code, then run:

python setup.py install

Write applications

Documents will update soon, now just refer to the wiki or weibo application.

Run applications

For the wiki or weibo app, please ensure the installation of dependencies, weibo as an example:

pip install -r /path/to/cola/app/weibo/requirements.txt

Local mode

In order to let your application support local mode, just add code to the entrance as below.

from cola.context import Context
ctx = Context(local_mode=True)
ctx.run_job(os.path.dirname(os.path.abspath(__file__)))

Then run the application:

python __init__.py

Stop the local job by CTRL+C.

Distributed mode

Start master:

coca master -s [ip:port]

Start one or more workers:

coca worker -s -m [ip:port]

Then run the application(weibo as an example):

coca job -u /path/to/cola/app/weibo -r

Coca command

Coca is a convenient command-line tool for the whole cola environment.

master

Kill master to stop the whole cluster:

coca master -k

job

List all jobs:

coca job -m [ip:port] -l

Example as:

list jobs at master: 10.211.55.2:11103
====> job id: 8ZcGfAqHmzc, job description: sina weibo crawler, status: stopped

You can run a job which shown in the list above:

coca job -r 8ZcGfAqHmzc

Actually, you don't have to input the complete job name:

coca job -r 8Z

Part of the job name is fine if there's no conflict.

You can know the status of a running job by:

coca job -t 8Z

The status like counters during running and so on will be output to the terminal.

You can kill a job by the kill command:

coca job -k 8Z

startproject

You can create an application by this command:

coca startproject colatest

Remember, help command will always be helpful:

coca -h

or

coca master -h

Notes

Chinese docs(wiki).

Donation

Cola is a non-profit project and by now maintained by myself, thus any donation will be encouragement for the further improvements of cola project.

Alipay & Paypal: [email protected]

You might also like...
Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

Async Python 3.6+ web scraping micro-framework based on asyncio
Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia 🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

Transistor, a Python web scraping framework for intelligent use cases.
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

PyQuery-based scraping micro-framework.

demiurge PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x. Documentation: http://demiurge.readthedocs.org Installing demiurge $ pip

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

Python framework to scrape Pastebin pastes and analyze them
Python framework to scrape Pastebin pastes and analyze them

pastepwn - Paste-Scraping Python Framework Pastebin is a very helpful tool to store or rather share ascii encoded data online. In the world of OSINT,

This Spider/Bot is developed using Python and based on Scrapy Framework to Fetch some items information from Amazon

- Hello, This Project Contains Amazon Web-bot. - I've developed this bot for fething some items information on Amazon. - Scrapy Framework in Python is

This is a web scraper, using Python framework Scrapy, built to extract data  from the Deals of the Day section on Mercado Livre website.
This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Deals of the Day This is a web scraper, using the Python framework Scrapy, built to extract data such as price and product name from the Deals of the

Comments
  • docs: Fix a few typos

    docs: Fix a few typos

    There are small typos in:

    • cola/cluster/master.py
    • cola/core/bloomfilter/init.py
    • cola/core/opener.py

    Fixes:

    • Should read experimentally rather than experimently.
    • Should read entries rather than enteries.
    • Should read continuously rather than continously.

    Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

    opened by timgates42 0
  • 任务执行完成后为什么始终不退出

    任务执行完成后为什么始终不退出

    Task类的run方法内有两个循环,最外面循环只有在stop事件出现后才出退出, 为什么?

    def run(self):
            try:
                curr_priority = 0
                while not self.stopped.is_set():
                    priority_name = 'inc' if curr_priority == self.n_priorities \
                                        else curr_priority
                    is_inc = priority_name == 'inc'
                    
                    while not self.nonsuspend.wait(5):
                        continue
                    if self.stopped.is_set():
                        break
                    
                    self.logger.debug('start to process priority: %s' % priority_name)
                    
                    last = self.priorities_secs[curr_priority]
                    clock = Clock()
                    runnings = []
                    try:
                        no_budgets_times = 0
                        while not self.stopped.is_set():
                            if clock.clock() >= last:
                                break
                            
                            if not is_inc:
                                status = self._apply(no_budgets_times)
                                if status == CANNOT_APPLY:
                                    break
                                elif status == APPLY_FAIL:
                                    no_budgets_times += 1
                                    if not self._has_not_finished(curr_priority) and \
                                        len(runnings) == 0:
                                        continue
                                    
                                    if self._has_not_finished(curr_priority) and \
                                        len(runnings) == 0:
                                        self._get_unit(curr_priority, runnings)
                                else:
                                    no_budgets_times = 0
                                    self._get_unit(curr_priority, runnings)
                            else:
                                self._get_unit(curr_priority, runnings)
                                
                            if len(runnings) == 0:
                                break
                            if self.is_bundle:
                                self.logger.debug(
                                    'process bundle from priority %s' % priority_name)
                                rest = min(last - clock.clock(), MAX_BUNDLE_RUNNING_SECONDS)
                                if rest <= 0:
                                    break
                                obj = self.executor.execute(runnings.pop(), rest, is_inc=is_inc)
                            else:
                                obj = self.executor.execute(runnings.pop(), is_inc=is_inc)
                                
                            if obj is not None:
                                runnings.insert(0, obj)  
                    finally:
                        self.priorities_objs[curr_priority].extend(runnings)
                        
                    curr_priority = (curr_priority+1) % self.full_priorities
            finally:
                self.counter_client.sync()
                self.save()
    
    opened by brightgems 5
  • 看了下,和上一个issues的log是一样的,应该是mq没有保护好的问题把

    看了下,和上一个issues的log是一样的,应该是mq没有保护好的问题把

    Exception in thread Thread-2: Traceback (most recent call last): File "/usr/local/lib/python2.7/threading.py", line 551, in *bootstrap_inner self.run() File "/usr/local/lib/python2.7/threading.py", line 504, in run self.__target(_self.__args, _self.__kwargs) File "/usr/crawl/code/cola-code/cola/core/mq/__init.py", line 103, in _init_process self.put(objs, flush=flush) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 407, in put self._remote_or_local_batch_put(addr, self.caches[addr]) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 348, in _remote_or_local_batch_put self.mq_node.batch_put(objs) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 151, in batch_put self.put(obs, force=force, priority=priority) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 125, in put priority_store.put(objs, force=force) File "/usr/crawl/code/cola-code/cola/core/mq/store.py", line 291, in put result = self.put_one(obj, force, commit=False) File "/usr/crawl/code/cola-code/cola/core/mq/store.py", line 266, in put_one pos = self._seek_writable_pos(m) File "/usr/crawl/code/cola-code/cola/core/mq/store.py", line 228, in _seek_writable_pos size, = struct.unpack('I', map_handle[pos:pos+4]) TypeError: 'NoneType' object has no attribute 'getitem'

    opened by tottilin 0
Releases(0.1.0beta)
Owner
Xuye (Chris) Qin
Core developer and architect of Mars which is a tensor-based unified framework for large scale data computation, also worked on PyODPS and cola.
Xuye (Chris) Qin
A simple flask application to scrape gogoanime website.

gogoanime-api-flask A simple flask application to scrape gogoanime website. Used for demo and learning purposes only. How to use the API The base api

1 Oct 29, 2021
Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Crawler Rottentomatoes, Goodreads and IMDB sites crawler. Crawler written by beautifulsoup, selenium and lxml to gather books and films information an

Faeze Ghorbanpour 1 Dec 30, 2021
A package designed to scrape data from Yahoo Finance.

yahoostock A package designed to scrape data from Yahoo Finance. Installation The most simple installation method is through PIP. pip install yahoosto

Rohan Singh 2 May 28, 2022
An experiment to deploy a serverless infrastructure for a scrapy project.

Serverless Scrapy project This project aims to evaluate the feasibility of an architecture based on serverless technology for a web crawler using scra

José Ferraz Neto 5 Jul 08, 2022
A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

Alex Papadopoulos 1 Nov 13, 2021
薅薅乐 - JD 测试脚本

薅薅乐 安裝 使用docker docker一键安装: docker run -d --name jd classmatelin/hhl:latest. 使用 进入容器: docker exec -it jd bash 获取JD_COOKIES: python get_jd_cookies.py,

ClassmateLin 575 Dec 28, 2022
A python module to parse the Open Graph Protocol

OpenGraph is a module of python for parsing the Open Graph Protocol, you can read more about the specification at http://ogp.me/ Installation $ pip in

Erik Rivera 213 Nov 12, 2022
Divar.ir Ads scrapper

Divar.ir Ads Scrapper Introduction This project first asynchronously grab Divar.ir Ads and then save to .csv and .xlsx files named data.csv and data.x

Iman Kermani 4 Aug 29, 2022
京东茅台抢购最新优化版本,京东秒杀,添加误差时间调整,优化了茅台抢购进程队列

京东茅台抢购最新优化版本,京东秒杀,添加误差时间调整,优化了茅台抢购进程队列

776 Jul 28, 2021
This script is intended to crawl license information of repositories through the GitHub API.

GithubLicenseCrawler This script is intended to crawl license information of repositories through the GitHub API. Taking a csv file with requirements.

schutera 4 Oct 25, 2022
A simple Discord scraper for discord bots

A simple Discord scraper for discord bots. That includes sending an guild members ids to an file, Mass inviter for joining servers your bot is in and Fetching all the servers of the bot (w/MemberCoun

3zg 1 Jan 06, 2022
Collection of code files to scrap different kinds of websites.

STW-Collection Scrap The Web Collection; blog posts. This repo contains Scrapy sample code to scrap the following kind of websites: Do you want to lea

Tapasweni Pathak 15 Jun 08, 2022
Binance harvester - A Python 3 script to harvest data from the Binance socket stream and calculate popular TA indicators and produce lists of top trending coins

Binance harvester - A Python 3 script to harvest data from the Binance socket stream and calculate popular TA indicators and produce lists of top trending coins

68 Oct 08, 2022
A scalable frontier for web crawlers

Frontera Overview Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large sc

Scrapinghub 1.2k Jan 02, 2023
A Scrapper with python

Scrapper-en-python Scrapper des données signifie récuperer des données pour les traiter ou les analyser. En python, il y'a 2 grands moyens de scrapper

Lun4rIum 1 Dec 05, 2021
Library to scrape and clean web pages to create massive datasets.

lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

Chip Huyen 2.1k Jan 06, 2023
热搜榜-python爬虫+正则re+beautifulsoup+xpath

仓库简介 微博热搜榜, 参数wb 百度热搜榜, 参数bd 360热点榜, 参数360 csdn热榜接口, 下方查看 其他热搜待加入 如何使用? 注册vercel fork到你的仓库, 右上角 点击这里完成部署(一键部署) 请求参数 vercel配置好的地址+api?tit=+参数(仓库简介有参数信息

Harry 3 Jul 08, 2022
A scrapy pipeline that provides an easy way to store files and images using various folder structures.

scrapy-folder-tree This is a scrapy pipeline that provides an easy way to store files and images using various folder structures. Supported folder str

Panagiotis Simakis 7 Oct 23, 2022
Deep Web Miner Python | Spyder Crawler

Webcrawler written in Python. This crawler does dig in till the 3 level of inside addressed and mine the respective data accordingly

Karan Arora 17 Jan 24, 2022
Danbooru scraper with python

Danbooru Version: 0.0.1 License under: MIT License Dependencies Python: = 3.9.7 beautifulsoup4 cloudscraper Example of use Danbooru from danbooru imp

Sugarbell 2 Oct 27, 2022