A universal package of scraper scripts for humans

Related tags

Web CrawlingScrapera
Overview

Logo

MIT License version-shield release-shield python-shield

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. Sponsors
  6. License
  7. Contact
  8. Acknowledgements

About The Project

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains. Scrapera directly and asynchronously scrapes from public API endpoints, thereby removing the heavy browser overhead which makes Scrapera extremely fast and robust to DOM changes. Currently, Scrapera supports the following crawlers:

  • Images
  • Text
  • Audio
  • Videos
  • Miscellaneous

  • The main aim of this package is to cluster common scraping tasks so as to make it more convenient for ML researchers and engineers to focus on their models rather than worrying about the data collection process

    DISCLAIMER: Owner or Contributors do not take any responsibility for misuse of data obtained through Scrapera. Contact the owner if copyright terms are violated due to any module provided by Scrapera.

    Prerequisites

    Prerequisites can be installed separately through the requirements.txt file as below

    pip install -r requirements.txt

    Installation

    Scrapera is built with Python 3 and can be pip installed directly

    pip install scrapera

    Alternatively, if you wish to install the latest version directly through GitHub then run

    pip install git+https://github.com/DarshanDeshpande/Scrapera.git

    Usage

    To use any sub-module, you just need to import, instantiate and execute

    from scrapera.video.vimeo import VimeoScraper
    scraper = VimeoScraper()
    scraper.scrape('https://vimeo.com/191955190', '540p')

    For more examples, please refer to the individual test folders in respective modules

    Contributing

    Scrapera welcomes any and all contributions and scraper requests. Please raise an issue if the scraper fails at any instance. Feel free to fork the repository and add your own scrapers to help the community!
    For more guidelines, refer to CONTRIBUTING

    License

    Distributed under the MIT License. See LICENSE for more information.

    Sponsors

    Logo

    Contact

    Feel free to reach out for any issues or requests related to Scrapera

    Darshan Deshpande (Owner) - Email | LinkedIn

    Acknowledgements

    Owner
    Helping Machines Learn Better 💻😃
    a way to scrape a database of all of the isef projects

    ISEF Database This is a simple web scraper which gets all of the projects and abstract information from here. My goal for this is for someone to get i

    William Kaiser 1 Mar 18, 2022
    基于Github Action的定时HITsz疫情上报脚本,开箱即用

    HITsz Daily Report 基于 GitHub Actions 的「HITsz 疫情系统」访问入口 定时自动上报脚本,开箱即用。 感谢 @JellyBeanXiewh 提供原始脚本和 idea。 感谢 @bugstop 对脚本进行重构并新增 Easy Connect 校内代理访问。

    Ter 56 Nov 27, 2022
    A dead simple crawler to get books information from Douban.

    Introduction A dead simple crawler to get books information from Douban. Pre-requesites Python 3 Install dependencies from requirements.txt (Optional)

    Yun Wang 1 Jan 10, 2022
    Telegram group scraper tool

    Telegram Group Scrapper

    Wahyusaputra 2 Jan 11, 2022
    Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

    Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc).

    Amit 6 Aug 26, 2022
    淘宝茅台抢购最新优化版本,淘宝茅台秒杀,优化了茅台抢购线程队列

    淘宝茅台抢购最新优化版本,淘宝茅台秒杀,优化了茅台抢购线程队列

    MaoTai 118 Dec 16, 2022
    New World Market Scraper

    Bean Seller A New Worlds market scraper. Deployment This must be installed on Windows as it uses the Windows api to do its stuff Install Prerequisites

    4 Sep 21, 2022
    热搜榜-python爬虫+正则re+beautifulsoup+xpath

    仓库简介 微博热搜榜, 参数wb 百度热搜榜, 参数bd 360热点榜, 参数360 csdn热榜接口, 下方查看 其他热搜待加入 如何使用? 注册vercel fork到你的仓库, 右上角 点击这里完成部署(一键部署) 请求参数 vercel配置好的地址+api?tit=+参数(仓库简介有参数信息

    Harry 3 Jul 08, 2022
    Goblyn is a Python tool focused to enumeration and capture of website files metadata.

    Goblyn Metadata Enumeration What's Goblyn? Goblyn is a tool focused to enumeration and capture of website files metadata. How it works? Goblyn will se

    Gustavo 46 Nov 22, 2022
    Web scrapping tool written in python3, using regex, to get CVEs, Source and URLs.

    searchcve Web scrapping tool written in python3, using regex, to get CVEs, Source and URLs. Generates a CSV file in the current directory. Uses the NI

    32 Oct 10, 2022
    This code will be able to scrape movies from a movie website and also provide download links to newly uploaded movies.

    Movies-Scraper You are probably tired of navigating through a movie website to get the right movie you'd want to watch during the weekend. There may e

    1 Jan 31, 2022
    A high-level distributed crawling framework.

    Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

    Xuye (Chris) Qin 1.5k Dec 24, 2022
    Library to scrape and clean web pages to create massive datasets.

    lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

    Chip Huyen 2.1k Jan 06, 2023
    👁️ Tool for Data Extraction and Web Requests.

    httpmapper 👁️ Project • Technologies • Installation • How it works • License Project 🚧 For educational purposes. This is a project that I developed,

    15 Dec 05, 2021
    Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

    Alpha Swap English This is a simple python tool for the purpose of swapping latinic letters with cirylic ones and vice versa, in txt, docx and pdf fil

    Aleksandar Damnjanovic 3 May 31, 2022
    API to parse tibia.com content into python objects.

    Tibia.py An API to parse Tibia.com content into object oriented data. No fetching is done by this module, you must provide the html content. Features:

    Allan Galarza 25 Oct 31, 2022
    Proxy scraper. Format: IP | PORT | COUNTRY | TYPE

    proxy scraper 🔎 Installation: git clone https://github.com/ebankoff/proxy_scraper Required pip libraries (pip install library name): lxml beautifulso

    Eban'ko 19 Dec 07, 2022
    crypto currency scraping

    SCRYPTO What ? Crypto currencies scraping (At the moment, only bitcoin and ethereum crypto currencies are supported) How ? A python script is running

    15 Sep 01, 2022
    This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

    LeasePlan - Scraper This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease. It has

    Rodney 4 Nov 18, 2022
    Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

    Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Documentation Documentation

    Gerapy 2.9k Jan 03, 2023