A pure-python HTML screen-scraping library

Related tags

Web Crawlingscrapely
Overview

Scrapely

https://api.travis-ci.org/scrapy/scrapely.svg?branch=master

Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

Overview

Scrapinghub wrote a nice blog post explaining how scrapely works and how it's used in Portia.

Installation

Scrapely works in Python 2.7 or 3.3+. It requires numpy and w3lib Python packages.

To install scrapely on any platform use:

pip install scrapely

If you're using Ubuntu (9.10 or above), you can install scrapely from the Scrapy Ubuntu repos. Just add the Ubuntu repos as described here: http://doc.scrapy.org/en/latest/topics/ubuntu.html

And then install scrapely with:

aptitude install python-scrapely

Usage (API)

Scrapely has a powerful API, including a template format that can be edited externally, that you can use to build very capable scrapers.

What follows is a quick example of the simplest possible usage, that you can run in a Python shell.

Start by importing and instantiating the Scraper class:

>>> from scrapely import Scraper
>>> s = Scraper()

Then, proceed to train the scraper by adding some page and the data you expect to scrape from there (note that all keys and values in the data you pass must be strings):

>>> url1 = 'http://pypi.python.org/pypi/w3lib/1.1'
>>> data = {'name': 'w3lib 1.1', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}
>>> s.train(url1, data)

Finally, tell the scraper to scrape any other similar page and it will return the results:

>>> url2 = 'http://pypi.python.org/pypi/Django/1.3'
>>> s.scrape(url2)
[{u'author': [u'Django Software Foundation <foundation at djangoproject com>'],
  u'description': [u'A high-level Python Web framework that encourages rapid development and clean, pragmatic design.'],
  u'name': [u'Django 1.3']}]

That's it! No xpaths, regular expressions, or hacky python code.

Usage (command line tool)

There is also a simple script to create and manage Scrapely scrapers.

It supports a command-line interface, and an interactive prompt. All commands supported on interactive prompt are also supported in the command-line interface.

To enter the interactive prompt type the following without arguments:

python -m scrapely.tool myscraper.json

Example:

$ python -m scrapely.tool myscraper.json
scrapely> help

Documented commands (type help <topic>):
========================================
a  al  s  ta  td  tl

scrapely>

To create a scraper and add a template:

scrapely> ta http://pypi.python.org/pypi/w3lib/1.1
[0] http://pypi.python.org/pypi/w3lib/1.1

This is equivalent as typing the following in one command:

python -m scrapely.tool myscraper.json ta http://pypi.python.org/pypi/w3lib/1.1

To list available templates from a scraper:

scrapely> tl
[0] http://pypi.python.org/pypi/w3lib/1.1

To add a new annotation, you usually test the selection criteria first:

scrapely> t 0 w3lib 1.1
[0] u'<h1>w3lib 1.1</h1>'
[1] u'<title>Python Package Index : w3lib 1.1</title>'

You can also quote the text, if you need to specify an arbitrary number of spaces, for example:

scrapely> t 0 "w3lib 1.1"

You can refine by position. To take the one in position [0]:

scrapely> a 0 w3lib 1.1 -n 0
[0] u'<h1>w3lib 1.1</h1>'

To annotate some fields on the template:

scrapely> a 0 w3lib 1.1 -n 0 -f name
[new] (name) u'<h1>w3lib 1.1</h1>'
scrapely> a 0 Scrapy project -n 0 -f author
[new] u'<span>Scrapy project</span>'

To list annotations on a template:

scrapely> al 0
[0-0] (name) u'<h1>w3lib 1.1</h1>'
[0-1] (author) u'<span>Scrapy project</span>'

To scrape another similar page with the already added templates:

scrapely> s http://pypi.python.org/pypi/Django/1.3
[{u'author': [u'Django Software Foundation'], u'name': [u'Django 1.3']}]

Tests

tox is the preferred way to run tests. Just run: tox from the root directory.

Support

Scrapely is created and maintained by the Scrapy group, so you can get help through the usual support channels described in the Scrapy community page.

Architecture

Unlike most scraping libraries, Scrapely doesn't work with DOM trees or xpaths so it doesn't depend on libraries such as lxml or libxml2. Instead, it uses an internal pure-python parser, which can accept poorly formed HTML. The HTML is converted into an array of token ids, which is used for matching the items to be extracted.

Scrapely extraction is based upon the Instance Based Learning algorithm [1] and the matched items are combined into complex objects (it supports nested and repeated objects), using a tree of parsers, inspired by A Hierarchical Approach to Wrapper Induction [2].

[1] Yanhong Zhai , Bing Liu, Extracting Web Data Using Instance-Based Learning, World Wide Web, v.10 n.2, p.113-132, June 2007
[2] Ion Muslea , Steve Minton , Craig Knoblock, A hierarchical approach to wrapper induction, Proceedings of the third annual conference on Autonomous Agents, p.190-197, April 1999, Seattle, Washington, United States

Known Issues

The training implementation is currently very simple and is only provided for references purposes, to make it easier to test Scrapely and play with it. On the other hand, the extraction code is reliable and production-ready. So, if you want to use Scrapely in production, you should use train() with caution and make sure it annotates the area of the page you intended.

Alternatively, you can use the Scrapely command line tool to annotate pages, which provides more manual control for higher accuracy.

How does Scrapely relate to Scrapy?

Despite the similarity in their names, Scrapely and Scrapy are quite different things. The only similarity they share is that they both depend on w3lib, and they are both maintained by the same group of developers (which is why both are hosted on the same Github account).

Scrapy is an application framework for building web crawlers, while Scrapely is a library for extracting structured data from HTML pages. If anything, Scrapely is more similar to BeautifulSoup or lxml than Scrapy.

Scrapely doesn't depend on Scrapy nor the other way around. In fact, it is quite common to use Scrapy without Scrapely, and viceversa.

If you are looking for a complete crawler-scraper solution, there is (at least) one project called Slybot that integrates both, but you can definitely use Scrapely on other web crawlers since it's just a library.

Scrapy has a builtin extraction mechanism called selectors which (unlike Scrapely) is based on XPaths.

License

Scrapely library is licensed under the BSD license.

Owner
Scrapy project
An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
Scrapy project
抢京东茅台脚本,定时自动触发,自动预约,自动停止

jd_maotai 抢京东茅台脚本,定时自动触发,自动预约,自动停止 小白信用 99.6,暂时还没抢到过,朋友 80 多抢到了一瓶,所以我感觉是跟信用分没啥关系,完全是看运气的。

Aruelius.L 117 Dec 22, 2022
用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。

crawler_for_university 用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。 环境依赖 wxpy,requests,bs4等库 功能描述 该项目基于python,通过爬虫爬各高校的就业信息网,爬取招聘信

8 Aug 16, 2021
DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques, based in France Only. The particularity of this program i

Dalunacrobate 347 Jan 07, 2023
A Python web scraper to scrape latest posts from official Coinbase's Blog.

Coinbase Blog Scraper A Python web scraper to scrape latest posts from official Coinbase's Blog. IDEA It scrapes up latest blog posts from https://blo

Lucas Villela 3 Feb 18, 2022
FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

UserGhost411 1 Nov 17, 2022
Current Antarctic large iceberg positions derived from ASCAT and OSCAT-2

Iceberg Locations Antarctic large iceberg positions derived from ASCAT and OSCAT-2. All data collected here are from the NASA SCP website Overview Thi

Joel Hanson 5 Jul 27, 2022
Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings and results from live.skidor.com Usage: Put the python file in a dedic

0 Jan 07, 2022
A Very simple free proxy list scraper.

Scrappp A Very simple free proxy list scraper, made in python The tool scrape proxy from diffrent sites and api's. Screenshots About the script !!! RE

Joji aka Moncef 12 Oct 27, 2022
jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人

jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人, 照顾我们这样的马大哈, 不会忘记抢购了, 祝大家过年都能喝上茅台. 特别声明: 本仓库发布的jd_maotai_rpa项目定义为自动化rpa项目, 是用于防止忘记参与jd茅台的活动(由于本人时常忘记), 而不是为了秒杀和抢

35 Nov 18, 2022
Kusonime scraper using python3

Features Scrap from url Scrap from recommendation Search by query Todo [+] Search by genre Example # Get download url from kusonime import Scrap

MhankBarBar 2 Jan 28, 2022
An Web Scraping API for MDL(My Drama List) for Python.

PyMDL An API for MyDramaList(MDL) based on webscraping for python. Description An API for MDL to make your life easier in retriving and working on dat

6 Dec 10, 2022
TarkovScrappy - A nifty little bot that lets you know if a queried item might be required for a quest at some point in the land of Tarkov!

TarkovScrappy A nifty little bot that lets you know if a queried item might be required for a quest at some point in the land of Tarkov! Hideout items

Joshua Smeda 2 Apr 11, 2022
This program scrapes information and images for movies and TV shows.

Media-WebScraper This program scrapes information and images for movies and TV shows. Summary For more information on the program, read the WebScrape_

1 Dec 05, 2021
This is a script that scrapes the longitude and latitude on food.grab.com

grab This is a script that scrapes the longitude and latitude for any restaurant in Manila on food.grab.com, location can be adjusted. Search Result p

0 Nov 22, 2021
Scraping Thailand COVID-19 data from the DDC's tableau dashboard

Scraping COVID-19 data from DDC Dashboard Scraping Thailand COVID-19 data from the DDC's tableau dashboard. Data is updated at 07:30 and 08:00 daily.

Noppakorn Jiravaranun 5 Jan 04, 2022
京东云无线宝积分推送,支持查看多设备积分使用情况

JDRouterPush 项目简介 本项目调用京东云无线宝API,可每天定时推送积分收益情况,帮助你更好的观察主要信息 更新日志 2021-03-02: 查询绑定的京东账户 通知排版优化 脚本检测更新 支持Server酱Turbo版 2021-02-25: 实现多设备查询 查询今

雷疯 199 Dec 12, 2022
script to scrape direct download links (ddls) from google drive index.

bhadoo Google Personal/Shared Drive Index scraper. A small script to scrape direct download links (ddls) of downloadable files from bhadoo google driv

sαɴᴊɪᴛ sɪɴʜα 53 Dec 16, 2022
Newsscraper - A simple Python 3 module to get crypto or news articles and their content from various RSS feeds.

NewsScraper A simple Python 3 module to get crypto or news articles and their content from various RSS feeds. 🔧 Installation Clone the repo locally.

Rokas 3 Jan 02, 2022
A scrapy pipeline that provides an easy way to store files and images using various folder structures.

scrapy-folder-tree This is a scrapy pipeline that provides an easy way to store files and images using various folder structures. Supported folder str

Panagiotis Simakis 7 Oct 23, 2022
A simple proxy scraper that utilizes the requests module in python.

Proxy Scraper A simple proxy scraper that utilizes the requests module in python. Usage Depending on your python installation your commands may vary.

3 Sep 08, 2021