PyQuery-based scraping micro-framework.

Last update: Jul 20, 2022

Related tags

Web Crawling demiurge

Overview

demiurge

PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x.

Documentation: http://demiurge.readthedocs.org

Installing demiurge

$ pip install demiurge

Quick start

Define items to be scraped using a declarative (Django-inspired) syntax:

import demiurge

class TorrentDetails(demiurge.Item):
    label = demiurge.TextField(selector='strong')
    value = demiurge.TextField()

    def clean_value(self, value):
        unlabel = value[value.find(':') + 1:]
        return unlabel.strip()

    class Meta:
        selector = 'div#specifications p'

class Torrent(demiurge.Item):
    url = demiurge.AttributeValueField(
        selector='td:eq(2) a:eq(1)', attr='href')
    name = demiurge.TextField(selector='td:eq(2) a:eq(2)')
    size = demiurge.TextField(selector='td:eq(3)')
    details = demiurge.RelatedItem(
        TorrentDetails, selector='td:eq(2) a:eq(2)', attr='href')

    class Meta:
        selector = 'table.maintable:gt(0) tr:gt(0)'
        base_url = 'http://www.mininova.org'


>>> t = Torrent.one('/search/ubuntu/seeds')
>>> t.name
'Ubuntu 7.10 Desktop Live CD'
>>> t.size
u'695.81\xa0MB'
>>> t.url
'/get/1053846'
>>> t.html
u'<td>19\xa0Dec\xa007</td><td><a href="/cat/7">Software</a></td><td>...'

>>> results = Torrent.all('/search/ubuntu/seeds')
>>> len(results)
116
>>> for t in results[:3]:
...     print t.name, t.size
...
Ubuntu 7.10 Desktop Live CD 695.81 MB
Super Ubuntu 2008.09 - VMware image 871.95 MB
Portable Ubuntu 9.10 for Windows 559.78 MB
...

>>> t = Torrent.one('/search/ubuntu/seeds')
>>> for detail in t.details:
...     print detail.label, detail.value
... 
Category: Software > GNU/Linux
Total size: 695.81 megabyte
Added: 2467 days ago by Distribution
Share ratio: 17 seeds, 2 leechers
Last updated: 35 minutes ago
Downloads: 29,085

See documentation for details: http://demiurge.readthedocs.org

Why demiurge?

Plato, as the speaker Timaeus, refers to the Demiurge frequently in the Socratic dialogue Timaeus, c. 360 BC. The main character refers to the Demiurge as the entity who "fashioned and shaped" the material world. Timaeus describes the Demiurge as unreservedly benevolent, and hence desirous of a world as good as possible. The world remains imperfect, however, because the Demiurge created the world out of a chaotic, indeterminate non-being.

http://en.wikipedia.org/wiki/Demiurge

Contributors

Martín Gaitán (@mgaitan)

Comments

Reausable cleaning functions
You can now add a "clean" kwarg containing a function to a field.

This makes it easy to use quick filtering (I want this data to be an int) and to re-use functions such as parsedatetime.

score = demiurge.TextField(selector=".score .upvoted", clean=int)
opened by traverseda 5
proof of concept: subitem field

short rationale: Sometimes I need to scrap a page to retrieve the actual links where the items are. I would like a way to nest Item classes, analog (in some way) to a in ForeignKey / ManyToManyField in Django.

This is a first PR as a proof of concept, to discuss the idea and its API.

opened by mgaitan 5
RelatedItems only work across urls

An obvious use of RelatedItems (or a similar construct) is recursively mapping a comment tree. Right now there's no elegant way to do that.

An example

http://pastebin.com/WDL4RjkE

Reading through the actual code, I think I might be wrong about this. I'll try and make the docs clearer.

opened by traverseda 2
Use lib "requests" for downloading

I'm right now making use of https://pypi.python.org/pypi/requests-cache which creates a cache of the downloaded stuff magically, and it's awesome. So, I would like to be able to take advantage of it using demiurge.

I don't know if just as an option or as a replacement of pyquery downloader.

What do you think?

opened by jmansilla 2
docs: fix simple typo, ocurrence -> occurrence

There is a small typo in docs/index.rst.

Should read occurrence rather than ocurrence.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

opened by timgates42 1
Fix when no selector defined
the default selector is the whole page ('html') but this is applied through PyQuery.find wich traverses down. example:

In [2]: PyQuery('<html>hello</html>').find('html') Out[2]: [] In [3]: PyQuery('<html>hello</html>')('html') Out[3]: [<html>]
opened by mgaitan 1
support self reference in RelatedItem

RelatedItem('self'). Also, the relateditem's item class could be given by its name (i.eRelatedItem("ItemClass")` ) A typical use case is a listing page with a "next page" link.

opened by mgaitan 0

Releases(v0.2)

v0.2(Sep 20, 2014)
Added docs.

Added RelatedItem.

Added field clean support.

Source code(tar.gz)
Source code(zip)

Owner

Matias Bordese

GitHub Repository http://demiurge.readthedocs.org

Haphazard scripts for scraping bitcoin/bitcoin data from GitHub

This is a quick-and-dirty tool used to scrape bitcoin/bitcoin pull request and commentary data. Each output/pr number folder contains comments.json:

8 Oct 12, 2022

Searching info from Google using Python Scrapy

Python-Search-Engine-Scrapy || Python-爬虫-索引/利用爬虫获取谷歌信息**/ Searching info from Google using Python Scrapy /* 利用 PYTHON 爬虫获取天气信息，以及城市信息和资料**/ translatio

1 Jan 06, 2022

Instagram profile scrapper with python

IG Profile Scrapper Instagram profile Scrapper Just type the username, and boo! :D Instalation clone this repo to your computer git clone https://gith

6 Nov 07, 2022

抖音批量下载用户所有无水印视频

Douyincrawler 抖音批量下载用户所有无水印视频 Run 安装python3，安装依赖

28 Dec 08, 2022

Library to scrape and clean web pages to create massive datasets.

lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

2.1k Jan 06, 2023

Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings and results from live.skidor.com Usage: Put the python file in a dedic

0 Jan 07, 2022

Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

Scrapy Cluster This Scrapy project uses Redis and Kafka to create a distributed

0 Jan 06, 2022

A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

GNews 🚩 A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response 🚩 As well as you can fetch full

273 Dec 31, 2022

Dex-scrapper - Hobby project for scrapping dex data on VeChain

Folders /zumo_abis # abi extracted from zumo repo /zumo_pools # runtime e

3 Jan 20, 2022

Collection of code files to scrap different kinds of websites.

STW-Collection Scrap The Web Collection; blog posts. This repo contains Scrapy sample code to scrap the following kind of websites: Do you want to lea

15 Jun 08, 2022

Web-scraping - Program that scrapes a website for a collection of quotes, picks one at random and displays it

web-scraping Program that scrapes a website for a collection of quotes, picks on

1 Jan 07, 2022

Divar.ir Ads scrapper

Divar.ir Ads Scrapper Introduction This project first asynchronously grab Divar.ir Ads and then save to .csv and .xlsx files named data.csv and data.x

4 Aug 29, 2022

IGLS - Instagram Like Scraper CLI tool

IGLS - Instagram Like Scraper It's a web scraping command line tool based on python and selenium. Description This is a trial tool for learning purpos

5 Oct 29, 2021

This is a module that I had created along with my friend. It's a basic web scraping module

QuickInfo PYPI link : https://pypi.org/project/quickinfo/ This is the library that you've all been searching for, it's built for developers and allows

2 Dec 13, 2021

A simple app to scrap data from Twitter.

Twitter-Scraping-App A simple app to scrap data from Twitter. Available Features Search query. Select number of data you want to fetch from twitter. C

2 Oct 31, 2022

Python scraper to check for earlier appointments in Clalit Health Services

clalit-appt-checker Python scraper to check for earlier appointments in Clalit Health Services Some background If you ever needed to schedule a doctor

16 Sep 17, 2022

WebScraping - Scrapes Job website for python developer jobs and exports the data to a csv file

WebScraping Web scraping Pyton program that scrapes Job website for python devel

2 Jul 22, 2022

This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

LeasePlan - Scraper This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease. It has

4 Nov 18, 2022

A leetcode scraper to compile all questions in leetcode free tier to text file. pdf also available.

A leetcode scraper to compile all questions in leetcode free tier to text file, pdf also available. if new questions get added, run again to get new questions.

3 Dec 07, 2021

Web-Scraping using Selenium Master

Web-Scraping using Selenium What is the need of Selenium? Some websites don't like to be scrapped and in that case you need to disguise your webscrapi

1 Oct 26, 2021