Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Overview

Video Games Web Scraper

Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

This project uses an open-source and collaborative framework named Scrapy.

Sources

Installation

I strongly recommend that you install this project in a dedicated virtual environment to avoid conflicting with your system packages.

See Virtual Environments and Packages on how to create and use your virtual environment.

Use the package manager pip to install the requirements of this project.

pip install -r requirements.txt

Usage

You can start crawling a source using a spider.

scrapy crawl <spider>

VideoGameGeek

Spiders

  • vgg-games
  • vgg-hotitems

Developer Resources

Initialize your Development Environment

pip install -r requirements.txt

Create and Run Tests

See the Spiders Contracts for more instructions on how to create tests for spiders and then run:

scrapy check

Scrapy Documentation

See the Scrapy Documentation for more instructions on how to create and modify spiders.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Support

If you enjoy this repository, please star this repository. By starring a repository, it shows appreciation to the repository maintainer for their work. Many of GitHub's repository rankings depend on the number of stars a repository has.

License

MIT

You might also like...
Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
A Smart, Automatic, Fast and Lightweight Web Scraper for Python

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file
A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country To run the file: Open terminal

Web and PDF Scraper Refactoring

Web and PDF Scraper Refactoring This repository contains the example code of the Web and PDF scraper code roast. Here are the links to the videos: Par

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)
Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc).

A web scraper that exports your entire WhatsApp chat history.
A web scraper that exports your entire WhatsApp chat history.

WhatSoup 🍲 A web scraper that exports your entire WhatsApp chat history. Table of Contents Overview Demo Prerequisites Instructions Frequen

A simple python web scraper.

Dissec A simple python web scraper. It gets a website and its contents and parses them with the help of bs4. Installation To install the requirements,

Web scraper for Zillow

Zillow-Scraper Instructions All terminal commands are highlighted. Make sure you first have python 3 installed. You can check this by running "python

Comments
  • Bump scrapy from 2.6.1 to 2.6.2

    Bump scrapy from 2.6.1 to 2.6.2

    Bumps scrapy from 2.6.1 to 2.6.2.

    Release notes

    Sourced from scrapy's releases.

    2.6.2

    Fixes a security issue around HTTP proxy usage, and addresses a few regressions introduced in Scrapy 2.6.0.

    See the changelog.

    Changelog

    Sourced from scrapy's changelog.

    Scrapy 2.6.2 (2022-07-25)

    Security bug fix:

    • When :class:~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware processes a request with :reqmeta:proxy metadata, and that :reqmeta:proxy metadata includes proxy credentials, :class:~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware sets the Proxy-Authentication header, but only if that header is not already set.

      There are third-party proxy-rotation downloader middlewares that set different :reqmeta:proxy metadata every time they process a request.

      Because of request retries and redirects, the same request can be processed by downloader middlewares more than once, including both :class:~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware and any third-party proxy-rotation downloader middleware.

      These third-party proxy-rotation downloader middlewares could change the :reqmeta:proxy metadata of a request to a new value, but fail to remove the Proxy-Authentication header from the previous value of the :reqmeta:proxy metadata, causing the credentials of one proxy to be sent to a different proxy.

      To prevent the unintended leaking of proxy credentials, the behavior of :class:~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware is now as follows when processing a request:

      • If the request being processed defines :reqmeta:proxy metadata that includes credentials, the Proxy-Authorization header is always updated to feature those credentials.

      • If the request being processed defines :reqmeta:proxy metadata without credentials, the Proxy-Authorization header is removed unless it was originally defined for the same proxy URL.

        To remove proxy credentials while keeping the same proxy URL, remove the Proxy-Authorization header.

      • If the request has no :reqmeta:proxy metadata, or that metadata is a falsy value (e.g. None), the Proxy-Authorization header is removed.

        It is no longer possible to set a proxy URL through the :reqmeta:proxy metadata but set the credentials through the Proxy-Authorization header. Set proxy credentials through the :reqmeta:proxy metadata instead.

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump pillow from 9.0.0 to 9.0.1

    Bump pillow from 9.0.0 to 9.0.1

    Bumps pillow from 9.0.0 to 9.0.1.

    Release notes

    Sourced from pillow's releases.

    9.0.1

    https://pillow.readthedocs.io/en/stable/releasenotes/9.0.1.html

    Changes

    • In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [@​radarhere, @​hugovk]
    • Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]
    Changelog

    Sourced from pillow's changelog.

    9.0.1 (2022-02-03)

    • In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [radarhere, hugovk]

    • Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

    Commits
    • 6deac9e 9.0.1 version bump
    • c04d812 Update CHANGES.rst [ci skip]
    • 4fabec3 Added release notes for 9.0.1
    • 02affaa Added delay after opening image with xdg-open
    • ca0b585 Updated formatting
    • 427221e In show_file, use os.remove to remove temporary images
    • c930be0 Restrict builtins within lambdas for ImageMath.eval
    • 75b69dd Dont need to pin for GHA
    • cd938a7 Autolink CWE numbers with sphinx-issues
    • 2e9c461 Add CVE IDs
    • See full diff in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump pillow from 9.0.0 to 9.0.1

    Bump pillow from 9.0.0 to 9.0.1

    Bumps pillow from 9.0.0 to 9.0.1.

    Release notes

    Sourced from pillow's releases.

    9.0.1

    https://pillow.readthedocs.io/en/stable/releasenotes/9.0.1.html

    Changes

    • In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [@​radarhere, @​hugovk]
    • Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]
    Changelog

    Sourced from pillow's changelog.

    9.0.1 (2022-02-03)

    • In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [radarhere, hugovk]

    • Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

    Commits
    • 6deac9e 9.0.1 version bump
    • c04d812 Update CHANGES.rst [ci skip]
    • 4fabec3 Added release notes for 9.0.1
    • 02affaa Added delay after opening image with xdg-open
    • ca0b585 Updated formatting
    • 427221e In show_file, use os.remove to remove temporary images
    • c930be0 Restrict builtins within lambdas for ImageMath.eval
    • 75b69dd Dont need to pin for GHA
    • cd938a7 Autolink CWE numbers with sphinx-issues
    • 2e9c461 Add CVE IDs
    • See full diff in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump scrapy from 2.5.1 to 2.6.0

    Bump scrapy from 2.5.1 to 2.6.0

    Bumps scrapy from 2.5.1 to 2.6.0.

    Release notes

    Sourced from scrapy's releases.

    2.6.0

    • Security fixes for cookie handling (see details below)
    • Python 3.10 support
    • asyncio support is no longer considered experimental, and works out-of-the-box on Windows regardless of your Python version
    • Feed exports now support pathlib.Path output paths and per-feed item filtering and post-processing

    See the full changelog

    Security bug fixes

    • When a Request object with cookies defined gets a redirect response causing a new Request object to be scheduled, the cookies defined in the original Request object are no longer copied into the new Request object.

      If you manually set the Cookie header on a Request object and the domain name of the redirect URL is not an exact match for the domain of the URL of the original Request object, your Cookie header is now dropped from the new Request object.

      The old behavior could be exploited by an attacker to gain access to your cookies. Please, see the cjvr-mfj7-j4j8 security advisory for more information.

      Note: It is still possible to enable the sharing of cookies between different domains with a shared domain suffix (e.g. example.com and any subdomain) by defining the shared domain suffix (e.g. example.com) as the cookie domain when defining your cookies. See the documentation of the Request class for more information.

    • When the domain of a cookie, either received in the Set-Cookie header of a response or defined in a Request object, is set to a public suffix <https://publicsuffix.org/>_, the cookie is now ignored unless the cookie domain is the same as the request domain.

      The old behavior could be exploited by an attacker to inject cookies from a controlled domain into your cookiejar that could be sent to other domains not controlled by the attacker. Please, see the mfjm-vh54-3f96 security advisory for more information.

    Changelog

    Sourced from scrapy's changelog.

    Scrapy 2.6.0 (2022-03-01)

    Highlights:

    • :ref:Security fixes for cookie handling <2.6-security-fixes>

    • Python 3.10 support

    • :ref:asyncio support <using-asyncio> is no longer considered experimental, and works out-of-the-box on Windows regardless of your Python version

    • Feed exports now support :class:pathlib.Path output paths and per-feed :ref:item filtering <item-filter> and :ref:post-processing <post-processing>

    .. _2.6-security-fixes:

    Security bug fixes

    
    -   When a :class:`~scrapy.http.Request` object with cookies defined gets a
        redirect response causing a new :class:`~scrapy.http.Request` object to be
        scheduled, the cookies defined in the original
        :class:`~scrapy.http.Request` object are no longer copied into the new
        :class:`~scrapy.http.Request` object.
    
    If you manually set the ``Cookie`` header on a
    :class:`~scrapy.http.Request` object and the domain name of the redirect
    URL is not an exact match for the domain of the URL of the original
    :class:`~scrapy.http.Request` object, your ``Cookie`` header is now dropped
    from the new :class:`~scrapy.http.Request` object.
    

    The old behavior could be exploited by an attacker to gain access to your cookies. Please, see the cjvr-mfj7-j4j8 security advisory_ for more information.

    .. _cjvr-mfj7-j4j8 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-cjvr-mfj7-j4j8

    .. note:: It is still possible to enable the sharing of cookies between different domains with a shared domain suffix (e.g. example.com and any subdomain) by defining the shared domain suffix (e.g. example.com) as the cookie domain when defining your cookies. See the documentation of the :class:~scrapy.http.Request class for more information.

    • When the domain of a cookie, either received in the Set-Cookie header of a response or defined in a :class:~scrapy.http.Request object, is set to a public suffix &lt;https://publicsuffix.org/&gt;_, the cookie is now </tr></table>

... (truncated)

Commits

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
  • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
  • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
  • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
  • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies 
opened by dependabot[bot] 0
Releases(v0.0.3)
Owner
Albert Marrero
Albert Marrero
Web Scraping images using Selenium and Python

Web Scraping images using Selenium and Python A propos de ce document This is a markdown document about Web scraping images and videos using Selenium

Nafaa BOUGRAINE 3 Jul 01, 2022
PyQuery-based scraping micro-framework.

demiurge PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x. Documentation: http://demiurge.readthedocs.org Installing demiurge $ pip

Matias Bordese 109 Jul 20, 2022
An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Autoscraper-n-blogger An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post and notifies via Telegram bot

GOKUL A.P 13 Dec 21, 2022
Download images from forum threads

Forum Image Scraper Downloads images from forum threads Only works with forums which doesn't require a login to view and have an incremental paginatio

9 Nov 16, 2022
Anonymously scrapes onlinesim.ru for new usable phone numbers.

phone-scraper Anonymously scrapes onlinesim.ru for new usable phone numbers. Usage Clone the repository $ git clone https://github.com/thomasgruebl/ph

16 Oct 08, 2022
一个m3u8视频流下载脚本

一个Python的m3u8流视频下载脚本 介绍 m3u8流视频日益常见,目前好用的下载器也有很多,我把之前自己写的一个小脚本分享出来,供广大网友使用。写此程序的目的在于给视频下载爱好者提供一个下载样例,可直接调用,勿再重复造轮子。 使用方法 在python中直接运行程序或进行外部调用 import

Nchu 0 Oct 10, 2021
OSTA web scraper, for checking the status of school buses in Ottawa

OSTA-La-Vista OSTA web scraper, for checking the status of school buses in Ottawa. Getting Started Using a Raspberry Pi, download Python 3, and option

1 Jan 28, 2022
A Smart, Automatic, Fast and Lightweight Web Scraper for Python

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

Mika 4.8k Jan 04, 2023
Generate a repository with mirror links for DriveDroid app

DriveDroid Repository Generator Generate a repository for the app that allow boot a PC using ISO files stored on your Android phone Check also an offi

Evgeny 11 Nov 19, 2022
Library to scrape and clean web pages to create massive datasets.

lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

Chip Huyen 2.1k Jan 06, 2023
DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques, based in France Only. The particularity of this program i

Dalunacrobate 347 Jan 07, 2023
Ebay Webscraper for Getting Average Product Price

Ebay-Webscraper-for-Getting-Average-Product-Price The code in this repo is used to determine the average price of an item on Ebay given a valid search

17 Jan 05, 2023
Open Crawl Vietnamese Text

Open Crawl Vietnamese Text This repo contains crawled Vietnamese text from multiple sources. This list of a topic-centric public data sources in high

QAI Research 4 Jan 05, 2022
Web scrapper para cotizar articulos

WebScrapper Este web scrapper esta desarrollado en python 3.10.0 para buscar en la pagina de cyber puerta articulos dentro del catalogo. El programa t

Jordan Gaona 1 Oct 27, 2021
Web scrapping

Project Setup Table of Contents Project Setup Table of Contents Run project locally Install Requirements Run script Run project locally Install Requir

Charles 3 Feb 04, 2022
Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

2 Nov 22, 2021
Script used to download data for stocks.

This script is useful for downloading stock market data for a wide range of companies specified by their respective tickers. The script reads in the d

Carmelo Gonzales 71 Oct 04, 2022
A Python module to bypass Cloudflare's anti-bot page.

cloudflare-scrape A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Reque

3k Jan 04, 2023
A crawler of doubamovie

豆瓣电影 A crawler of doubamovie 一个小小的入门级scrapy框架的应用,选取豆瓣电影对排行榜前1000的电影数据进行爬取。 spider.py start_requests方法为scrapy的方法,我们对它进行重写。 def start_requests(self):

Cats without dried fish 1 Oct 05, 2021
对于有验证码的站点爆破,用于安全合法测试

使用方法 python3 main.py + 配置好的文件 python3 main.py Verify.json python3 main.py NoVerify.json 以上分别对应有验证码的demo和无验证码的demo Tips: 你可以以域名作为配置文件名字加载:python3 main

47 Nov 09, 2022