Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Last update: Jan 24, 2022

Overview

Toxicity comments crawler

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Twitter

Tweets and replies are scraped from Twitter API for a given list of users.

Twitch

Coming soon.

YouTube

Coming soon.

Facebook

Coming soon.

Instagram

Coming soon.

The toxic level of a given comment is calculated using the Perspective API.

Architecture

Usage

To run the crawler, you need to provide the following environment variables:

Variable	Description	Default	Required
`AWS_ROLE_ARN`	AWS Role ARN	`None`	Optional
`AWS_WEB_IDENTITY_TOKEN_FILE`	AWS Web Identity Token File	`None`	Optional
`AWS_ACCESS_KEY_ID`	AWS Access Key ID	`None`	Optional
`AWS_SECRET_ACCESS_KEY`	AWS Secret Access Key	`None`	Optional
`AWS_S3_BUCKET`	AWS S3 Bucket	`None`	Required
`AWS_S3_BUCKET_PREFIX`	AWS S3 Bucket Prefix	`None`	Required
`LOG_LEVEL`	Log level	`INFO`	Optional
`PERSPECTIVE_API_KEY`	Perspective API Key	`None`	Required
`PERSPECTIVE_THRESHOLD`	Perspective Threshold	`0.5`	Required
`FILTER_TOXIC_COMMENTS`	Filter Toxic Comments	`True`	Required
`TWITTER_CONSUMER_KEY`	Twitter Consumer Key	`None`	Required
`TWITTER_CONSUMER_SECRET`	Twitter Consumer Secret	`None`	Required
`TWITTER_ACCESS_TOKEN`	Twitter Access Token	`None`	Required
`TWITTER_ACCESS_TOKEN_SECRET`	Twitter Access Token Secret	`None`	Required
`TWITTER_MAX_TWEETS`	Twitter Max Tweets or replies	`None`	Required

If AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE are provided, the crawler will use them to assume a role, and will not use AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY.

Running

Prerequisites

Docker

Then, you can run the crawler with the following command:

docker run --env-file .env -d dougtrajano/toxicity-crawler:latest

License

The project is licensed under the Apache 2.0 License.

This program scrapes information and images for movies and TV shows.

Media-WebScraper This program scrapes information and images for movies and TV shows. Summary For more information on the program, read the WebScrape_

1 Dec 5, 2021

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Documentation Documentation

2.9k Jan 3, 2023

A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

2 Apr 29, 2022

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Autoscraper-n-blogger An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post and notifies via Telegram bot

13 Dec 21, 2022

This is a script that scrapes the longitude and latitude on food.grab.com

grab This is a script that scrapes the longitude and latitude for any restaurant in Manila on food.grab.com, location can be adjusted. Search Result p

0 Nov 22, 2021

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

mcc-mnc.com-webscraper Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX) A Python script for web scraping mcc-mnc.com Link: mcc

1 Nov 7, 2021

Scrapes all articles and their headlines from theonion.com

The Onion Article Scraper Scrapes all articles and their headlines from the satirical news website https://www.theonion.com Also see Clickhole Article

0 Nov 17, 2021

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country To run the file: Open terminal

2 Jun 6, 2022

Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Crawler Rottentomatoes, Goodreads and IMDB sites crawler. Crawler written by beautifulsoup, selenium and lxml to gather books and films information an

1 Dec 30, 2021

Releases(0.2.1)

0.2.1(Dec 27, 2021)
What's Changed

Add wait_on_rate_limit in TwitterAPI by @DougTrajano in https://github.com/DougTrajano/toxicity-crawler/pull/29

Full Changelog: https://github.com/DougTrajano/toxicity-crawler/compare/0.2.0...0.2.1
Source code(tar.gz)
Source code(zip)
0.2.0(Dec 25, 2021)
What's Changed

Fixed an issue with tweet content in TwitterAPI by @DougTrajano

Added an exploratory notebook to test TwitterAPI by @DougTrajano

Bump pyyaml from 5.4.1 to 6.0 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/12

Bump google-api-python-client from 2.22.0 to 2.33.0 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/26

Bump metaflow from 2.3.6 to 2.4.7 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/28

Full Changelog: https://github.com/DougTrajano/toxicity-crawler/compare/0.1.4...0.2.0
Source code(tar.gz)
Source code(zip)
0.1.4(Sep 26, 2021)
Changes

Bump google-api-python-client from 2.21.0 to 2.22.0 #3

Fix Python path in Dockerfile

Source code(tar.gz)
Source code(zip)
0.1.3(Sep 24, 2021)
Changes

Updated GitHub Action.

Fix error in Docker execution.

Source code(tar.gz)
Source code(zip)
0.1.2(Sep 24, 2021)

Updated GitHub Action
Source code(tar.gz)
Source code(zip)
0.1.1(Sep 24, 2021)

Updated GitHub Action
Source code(tar.gz)
Source code(zip)
0.1.0(Sep 24, 2021)

Initial version
Source code(tar.gz)
Source code(zip)

Owner

Douglas Trajano

Data Scientist

GitHub Repository

河南工业大学完美校园自动校外打卡

HAUT-checkin 河南工业大学自动校外打卡由于github actions存在明显延迟，建议直接使用腾讯云函数特点多人打卡使用简单，仅需账号密码以及用于微信推送的uid 自动获取上一次打卡信息用于打卡向所有成员微信单独推送打卡状态完美校园服务器繁忙时造成打卡失败会自动重新打卡

36 Oct 27, 2022

Simple library for exploring/scraping the web or testing a website you’re developing

Robox is a simple library with a clean interface for exploring/scraping the web or testing a website you’re developing. Robox can fetch a page, click on links and buttons, and fill out and submit for

79 Nov 27, 2022

A web crawler for recording posts in "sina weibo"

Web Crawler for "sina weibo" A web crawler for recording posts in "sina weibo" Introduction This script helps collect attributes of posts in "sina wei

4 Aug 20, 2022

An Web Scraping API for MDL(My Drama List) for Python.

PyMDL An API for MyDramaList(MDL) based on webscraping for python. Description An API for MDL to make your life easier in retriving and working on dat

6 Dec 10, 2022

Deep Web Miner Python | Spyder Crawler

Webcrawler written in Python. This crawler does dig in till the 3 level of inside addressed and mine the respective data accordingly

17 Jan 24, 2022

Web-Scrapper using Python and Flask

Web-Scrapper "[초급]Python으로 웹 스크래퍼 만들기" 코스 -NomadCoders 기초적인 Python 문법강의부터 시작하여 웹사이트의 html파일에서 원하는 내용을 Scrapping해서 출력, csv 파일로 저장, flask를 이용한 간단한 웹페이지

1 Nov 10, 2021

UsernameScraperTool - Username Scraper Tool With Python

UsernameScraperTool Username Scraper for 40+ Social sites. How To use git clone

1 Dec 20, 2022

Jobinja.ir jobs scraper.

Jobinja.ir Dataset Introduction This project is a simple web scraper that scraps pages of jobinja.ir concurrently and writes and update (if file gets

3 Apr 15, 2022

Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup.

WebScrapperRoBot Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup. Mark your Star ⭐ ⭐ What is Web Scraping ? Web s

53 Dec 21, 2022

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

1 Jan 10, 2022

New World Market Scraper

Bean Seller A New Worlds market scraper. Deployment This must be installed on Windows as it uses the Windows api to do its stuff Install Prerequisites

4 Sep 21, 2022

A simple proxy scraper that utilizes the requests module in python.

Proxy Scraper A simple proxy scraper that utilizes the requests module in python. Usage Depending on your python installation your commands may vary.

3 Sep 08, 2021

Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

Alpha Swap English This is a simple python tool for the purpose of swapping latinic letters with cirylic ones and vice versa, in txt, docx and pdf fil

3 May 31, 2022

Pseudo API for Google Trends

pytrends Introduction Unofficial API for Google Trends Allows simple interface for automating downloading of reports from Google Trends. Only good unt

2.6k Dec 28, 2022

This tool can be used to extract information from any website

WEB-INFO- This tool can be used to extract information from any website Install Termux and run the command --- $ apt-get update $ apt-get upgrade $ pk

1 Oct 24, 2021

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

lxSpider 爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说网站、招标采购网》简介：时光荏苒，记不清写了多少案例了。

793 Jan 05, 2023

A Python library for automating interaction with websites.

Home page https://mechanicalsoup.readthedocs.io/ Overview A Python library for automating interaction with websites. MechanicalSoup automatically stor

4.3k Jan 07, 2023

A Python Oriented tool to Scrap WhatsApp Group Link using Google Dork it Scraps Whatsapp Group Links From Google Results And Gives Working Links.

WaGpScraper A Python Oriented tool to Scrap WhatsApp Group Link using Google Dork it Scraps Whatsapp Group Links From Google Results And Gives Working

27 Dec 18, 2022

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

704 Jan 06, 2023

This project was created using Python technology and flask tools to scrape a music site

python-scrapping This project was created using Python technology and flask tools to scrape a music site You need to install the following packages to

1 Dec 07, 2021

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Related tags

Overview

Toxicity comments crawler

Architecture

Usage

Running

Prerequisites

License

You might also like...

This program scrapes information and images for movies and TV shows.

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

A web crawler script that crawls the target website and lists its links

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

This is a script that scrapes the longitude and latitude on food.grab.com

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

Scrapes all articles and their headlines from theonion.com

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Releases(0.2.1)

0.2.1(Dec 27, 2021)

What's Changed

0.2.0(Dec 25, 2021)

What's Changed

0.1.4(Sep 26, 2021)

Changes

0.1.3(Sep 24, 2021)

Changes

0.1.2(Sep 24, 2021)

0.1.1(Sep 24, 2021)

0.1.0(Sep 24, 2021)

Owner

Douglas Trajano

河南工业大学 完美校园 自动校外打卡

Simple library for exploring/scraping the web or testing a website you’re developing

A web crawler for recording posts in "sina weibo"

An Web Scraping API for MDL(My Drama List) for Python.

Deep Web Miner Python | Spyder Crawler

Web-Scrapper using Python and Flask

UsernameScraperTool - Username Scraper Tool With Python

Jobinja.ir jobs scraper.

Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup.

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

New World Market Scraper

A simple proxy scraper that utilizes the requests module in python.

Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

Pseudo API for Google Trends

This tool can be used to extract information from any website

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

A Python library for automating interaction with websites.

A Python Oriented tool to Scrap WhatsApp Group Link using Google Dork it Scraps Whatsapp Group Links From Google Results And Gives Working Links.

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

This project was created using Python technology and flask tools to scrape a music site

河南工业大学完美校园自动校外打卡