A high-level distributed crawling framework.

Related tags

Web Crawlingcola
Overview

Cola: high-level distributed crawling framework

Overview

Cola is a high-level distributed crawling framework, used to crawl pages and extract structured data from websites. It provides simple and fast yet flexible way to achieve your data acquisition objective. Users only need to write one piece of code which can run under both local and distributed mode.

Requirements

  • Python2.7 (Python3+ will be supported later)
  • Work on Linux, Windows and Mac OSX

Install

The quick way:

pip install cola

Or, download source code, then run:

python setup.py install

Write applications

Documents will update soon, now just refer to the wiki or weibo application.

Run applications

For the wiki or weibo app, please ensure the installation of dependencies, weibo as an example:

pip install -r /path/to/cola/app/weibo/requirements.txt

Local mode

In order to let your application support local mode, just add code to the entrance as below.

from cola.context import Context
ctx = Context(local_mode=True)
ctx.run_job(os.path.dirname(os.path.abspath(__file__)))

Then run the application:

python __init__.py

Stop the local job by CTRL+C.

Distributed mode

Start master:

coca master -s [ip:port]

Start one or more workers:

coca worker -s -m [ip:port]

Then run the application(weibo as an example):

coca job -u /path/to/cola/app/weibo -r

Coca command

Coca is a convenient command-line tool for the whole cola environment.

master

Kill master to stop the whole cluster:

coca master -k

job

List all jobs:

coca job -m [ip:port] -l

Example as:

list jobs at master: 10.211.55.2:11103
====> job id: 8ZcGfAqHmzc, job description: sina weibo crawler, status: stopped

You can run a job which shown in the list above:

coca job -r 8ZcGfAqHmzc

Actually, you don't have to input the complete job name:

coca job -r 8Z

Part of the job name is fine if there's no conflict.

You can know the status of a running job by:

coca job -t 8Z

The status like counters during running and so on will be output to the terminal.

You can kill a job by the kill command:

coca job -k 8Z

startproject

You can create an application by this command:

coca startproject colatest

Remember, help command will always be helpful:

coca -h

or

coca master -h

Notes

Chinese docs(wiki).

Donation

Cola is a non-profit project and by now maintained by myself, thus any donation will be encouragement for the further improvements of cola project.

Alipay & Paypal: [email protected]

You might also like...
Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

Async Python 3.6+ web scraping micro-framework based on asyncio
Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia 🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

Transistor, a Python web scraping framework for intelligent use cases.
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

PyQuery-based scraping micro-framework.

demiurge PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x. Documentation: http://demiurge.readthedocs.org Installing demiurge $ pip

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

Python framework to scrape Pastebin pastes and analyze them
Python framework to scrape Pastebin pastes and analyze them

pastepwn - Paste-Scraping Python Framework Pastebin is a very helpful tool to store or rather share ascii encoded data online. In the world of OSINT,

This Spider/Bot is developed using Python and based on Scrapy Framework to Fetch some items information from Amazon

- Hello, This Project Contains Amazon Web-bot. - I've developed this bot for fething some items information on Amazon. - Scrapy Framework in Python is

This is a web scraper, using Python framework Scrapy, built to extract data  from the Deals of the Day section on Mercado Livre website.
This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Deals of the Day This is a web scraper, using the Python framework Scrapy, built to extract data such as price and product name from the Deals of the

Comments
  • docs: Fix a few typos

    docs: Fix a few typos

    There are small typos in:

    • cola/cluster/master.py
    • cola/core/bloomfilter/init.py
    • cola/core/opener.py

    Fixes:

    • Should read experimentally rather than experimently.
    • Should read entries rather than enteries.
    • Should read continuously rather than continously.

    Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

    opened by timgates42 0
  • 任务执行完成后为什么始终不退出

    任务执行完成后为什么始终不退出

    Task类的run方法内有两个循环,最外面循环只有在stop事件出现后才出退出, 为什么?

    def run(self):
            try:
                curr_priority = 0
                while not self.stopped.is_set():
                    priority_name = 'inc' if curr_priority == self.n_priorities \
                                        else curr_priority
                    is_inc = priority_name == 'inc'
                    
                    while not self.nonsuspend.wait(5):
                        continue
                    if self.stopped.is_set():
                        break
                    
                    self.logger.debug('start to process priority: %s' % priority_name)
                    
                    last = self.priorities_secs[curr_priority]
                    clock = Clock()
                    runnings = []
                    try:
                        no_budgets_times = 0
                        while not self.stopped.is_set():
                            if clock.clock() >= last:
                                break
                            
                            if not is_inc:
                                status = self._apply(no_budgets_times)
                                if status == CANNOT_APPLY:
                                    break
                                elif status == APPLY_FAIL:
                                    no_budgets_times += 1
                                    if not self._has_not_finished(curr_priority) and \
                                        len(runnings) == 0:
                                        continue
                                    
                                    if self._has_not_finished(curr_priority) and \
                                        len(runnings) == 0:
                                        self._get_unit(curr_priority, runnings)
                                else:
                                    no_budgets_times = 0
                                    self._get_unit(curr_priority, runnings)
                            else:
                                self._get_unit(curr_priority, runnings)
                                
                            if len(runnings) == 0:
                                break
                            if self.is_bundle:
                                self.logger.debug(
                                    'process bundle from priority %s' % priority_name)
                                rest = min(last - clock.clock(), MAX_BUNDLE_RUNNING_SECONDS)
                                if rest <= 0:
                                    break
                                obj = self.executor.execute(runnings.pop(), rest, is_inc=is_inc)
                            else:
                                obj = self.executor.execute(runnings.pop(), is_inc=is_inc)
                                
                            if obj is not None:
                                runnings.insert(0, obj)  
                    finally:
                        self.priorities_objs[curr_priority].extend(runnings)
                        
                    curr_priority = (curr_priority+1) % self.full_priorities
            finally:
                self.counter_client.sync()
                self.save()
    
    opened by brightgems 5
  • 看了下,和上一个issues的log是一样的,应该是mq没有保护好的问题把

    看了下,和上一个issues的log是一样的,应该是mq没有保护好的问题把

    Exception in thread Thread-2: Traceback (most recent call last): File "/usr/local/lib/python2.7/threading.py", line 551, in *bootstrap_inner self.run() File "/usr/local/lib/python2.7/threading.py", line 504, in run self.__target(_self.__args, _self.__kwargs) File "/usr/crawl/code/cola-code/cola/core/mq/__init.py", line 103, in _init_process self.put(objs, flush=flush) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 407, in put self._remote_or_local_batch_put(addr, self.caches[addr]) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 348, in _remote_or_local_batch_put self.mq_node.batch_put(objs) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 151, in batch_put self.put(obs, force=force, priority=priority) File "/usr/crawl/code/cola-code/cola/core/mq/node.py", line 125, in put priority_store.put(objs, force=force) File "/usr/crawl/code/cola-code/cola/core/mq/store.py", line 291, in put result = self.put_one(obj, force, commit=False) File "/usr/crawl/code/cola-code/cola/core/mq/store.py", line 266, in put_one pos = self._seek_writable_pos(m) File "/usr/crawl/code/cola-code/cola/core/mq/store.py", line 228, in _seek_writable_pos size, = struct.unpack('I', map_handle[pos:pos+4]) TypeError: 'NoneType' object has no attribute 'getitem'

    opened by tottilin 0
Releases(0.1.0beta)
Owner
Xuye (Chris) Qin
Core developer and architect of Mars which is a tensor-based unified framework for large scale data computation, also worked on PyODPS and cola.
Xuye (Chris) Qin
A Telegram crawler to search groups and channels automatically and collect any type of data from them.

Introduction This is a crawler I wrote in Python using the APIs of Telethon months ago. This tool was not intended to be publicly available for a numb

39 Dec 28, 2022
Scrapping Connections' info on Linkedin

Scrapping Connections' info on Linkedin

MohammadReza Ardestani 1 Feb 11, 2022
A Very simple free proxy list scraper.

Scrappp A Very simple free proxy list scraper, made in python The tool scrape proxy from diffrent sites and api's. Screenshots About the script !!! RE

Joji aka Moncef 12 Oct 27, 2022
腾讯课堂,模拟登陆,获取课程信息,视频下载,视频解密。

腾讯课堂脚本 要学一些东西,但腾讯课堂不支持自定义变速,播放时有水印,且有些老师的课一遍不够看,于是这个脚本诞生了。 时间比较紧张,只会不定时修复重大bug。多线程下载之类的功能更新短期内不会有,如果你想一起完善这个脚本,欢迎pr 2020.5.22测试可用 使用方法 很简单,三部完成 下载代码,

163 Dec 30, 2022
Binance Smart Chain Contract Scraper + Contract Evaluator

Pulls Binance Smart Chain feed of newly-verified contracts every 30 seconds, then checks their contract code for links to socials.Returns only those with socials information included, and then submit

14 Dec 09, 2022
robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

Joshua Carp 3.7k Dec 27, 2022
A Python package that scrapes Google News article data while remaining undetected by Google.

A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https

Geminid Systems, Inc 6 Aug 10, 2022
A leetcode scraper to compile all questions in leetcode free tier to text file. pdf also available.

A leetcode scraper to compile all questions in leetcode free tier to text file, pdf also available. if new questions get added, run again to get new questions.

3 Dec 07, 2021
Scraping Thailand COVID-19 data from the DDC's tableau dashboard

Scraping COVID-19 data from DDC Dashboard Scraping Thailand COVID-19 data from the DDC's tableau dashboard. Data is updated at 07:30 and 08:00 daily.

Noppakorn Jiravaranun 5 Jan 04, 2022
Web-scraping - Program that scrapes a website for a collection of quotes, picks one at random and displays it

web-scraping Program that scrapes a website for a collection of quotes, picks on

Manvir Mann 1 Jan 07, 2022
Get paper names from dblp.org

scraper-dblp Get paper names from dblp.org and store them in a .txt file Useful for a related literature :) Install libraries pip3 install -r requirem

Daisy Lab 1 Dec 07, 2021
自动完成每日体温上报(Github Actions)

体温上报助手 简介 每天 10:30 GMT+8 自动完成体温上报,如想修改定时运行的时间,可修改 .github/workflows/SduHealthReport.yml 中 schedule 属性。 如果当日有异常,请手动在小程序端/PC 端填写!

Teng Zhang 23 Sep 15, 2022
fork huanghyw/jd_seckill

Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性,完整性和有效性,请根据情况自行判断。 本项目内所有资源文件,禁止任何公众号、自媒体进行任何形式的转载、发布。

512 Jan 03, 2023
An Web Scraping API for MDL(My Drama List) for Python.

PyMDL An API for MyDramaList(MDL) based on webscraping for python. Description An API for MDL to make your life easier in retriving and working on dat

6 Dec 10, 2022
for those who dont want to pay $10/month for high school game footage with ads

nfhs-scraper Disclaimer: I am in no way responsible for what you choose to do with this script and guide. I do not endorse avoiding paywalls or any il

Conrad Crawford 5 Apr 12, 2022
Grab the changelog from releases on Github

release-notes-scraper This simple script can be used to grab the release notes for projects from github that do not keep a CHANGELOG, but publish thei

Dan Čermák 4 Apr 01, 2022
New World Market Scraper

Bean Seller A New Worlds market scraper. Deployment This must be installed on Windows as it uses the Windows api to do its stuff Install Prerequisites

4 Sep 21, 2022
Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings and results from live.skidor.com Usage: Put the python file in a dedic

0 Jan 07, 2022
This tool can be used to extract information from any website

WEB-INFO- This tool can be used to extract information from any website Install Termux and run the command --- $ apt-get update $ apt-get upgrade $ pk

1 Oct 24, 2021
A module for CME that spiders hashes across the domain with a given hash.

hash_spider A module for CME that spiders hashes across the domain with a given hash. Installation Simply copy hash_spider.py to your CME module folde

37 Sep 08, 2022