A Telegram crawler to search groups and channels automatically and collect any type of data from them.

Overview

Introduction

This is a crawler I wrote in Python using the APIs of Telethon months ago. This tool was not intended to be publicly available for a number of reasons, but eventually I decided to distribute it "as it is". Any contribution to the project is more than welcome :)

Installation

Python 3.8.2 and Telethon 1.21.1 are required (along with other common packages, just read the imports), I don't guarantee it works with newer versions of Telethon.

To install Telethon just read their documentation, while to install this repository just git clone and run python3.8 scraper.py only after configurated the script properly (next section).

Configuration

To use this tool you have to first obtain an API ID and an API HASH from Telegram: you can do this by following this page. Once done the ID and the HASH can be inserted into the code and the script can be launched. The first time it runs, it will ask to insert the telephone number.

Usage

In the code, there are two methods to initialize the crawler: init_empty() and init(). The former is used for the very first time that the script has been launched, while the latter is needed only in specific situations (read the code for details). Once the crawler has been launched with init_empty() and terminated, it basically processed all the groups/channels where the account is already in, collecting all the links shared in the chats along with a number of other data such that:

  1. Name of the group/channel
  2. Username
  3. List of members (just for groups)
  4. List of the last n messages
  5. Other metadata...

groups

These information have been saved in a pickle file called groups. Other files given in output are to_be_processed and edges. The former is a list of links that will be processed in the next iteration (see later) and the other one is a list of tuples of the form (group id, [group id list]) where the first entry represents the so called destination vertex and the list represents the origin vertices. Indeed, this uncommon data structure is the edge list of the search graph produced by the crawler (this is useful for data mining purposes, for instance I used it to perform link prediction between groups/channels exploiting both the graph structure and the messages). Probably is not the best data structure, since you have to reverse it later on if you want to actually use it for other tasks, but it is faster than other solutions to update.

Once the initialization is complete, you can comment init_empty() and uncomment start() in main() to process the new links collected before. This will generate three new files: groups2, to_be_processed2 and edges2. Now you have to merge the old files with the new ones (I actually have a script that does that, but it is customized according to my environment so I will publish it as soon as I have time to generalize it): pay attention to the to_be_processed file, because you don't want to process it entirely in each run. You need to divide it since it will take too long, indeed there are some limitations to process a lot of groups, therefore you want to play smarter... next section.

Limitations

Telegram doesn't want that you play with users' data therefore for each account, if I recall correctly, you can join 25 groups per hour, then Telegram will stop you. The script handles this, so if your usage is not so intensive it will not make your task unfeasible. But if you need hundreds or thousands of groups, well, you have to parallelize the script. I won't publish the code for doing this, but it is not so difficult to make this idea practical.

Owner
Computer Science student at "La Sapienza"
This is a sport analytics project that combines the knowledge of OOP and Webscraping

This is a sport analytics project that combines the knowledge of Object Oriented Programming (OOP) and Webscraping, the weekly scraping of the English Premier league table is carried out to assess th

Dolamu Oludare 1 Nov 26, 2021
Scrape all the media from an OnlyFans account - Updated regularly

Scrape all the media from an OnlyFans account - Updated regularly

CRIMINAL 3.2k Dec 29, 2022
Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.

Instagram_scrapper This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or exce

Lakhdar Belkharroubi 5 Oct 17, 2022
A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Universal Online Judge Spider Introduction This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/). It also works for all other Onl

TriNitroTofu 1 Dec 07, 2021
A package designed to scrape data from Yahoo Finance.

yahoostock A package designed to scrape data from Yahoo Finance. Installation The most simple installation method is through PIP. pip install yahoosto

Rohan Singh 2 May 28, 2022
This program will help you to properly scrape all data from a specific website

This program will help you to properly scrape all data from a specific website

MD. MINHAZ 0 May 15, 2022
京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

MaoTai 129 Dec 14, 2022
a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

This is George's Scraping Project To get started cd into the theZoo file and run: chmod +x script.sh then: ./script.sh This will spin up a Postgres co

George Reyes 7 Nov 27, 2022
Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

Scrapy project 859 Dec 29, 2022
Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

Jason Nguyen 1 Oct 29, 2021
Scrap the 42 Intranet's elearning videos in a single click

42intra_scraper Scrap the 42 Intranet's elearning videos in a single click. Why you would want to use it ? Adjust speed at your convenience. (The intr

Noufel 5 Oct 27, 2022
腾讯课堂,模拟登陆,获取课程信息,视频下载,视频解密。

腾讯课堂脚本 要学一些东西,但腾讯课堂不支持自定义变速,播放时有水印,且有些老师的课一遍不够看,于是这个脚本诞生了。 时间比较紧张,只会不定时修复重大bug。多线程下载之类的功能更新短期内不会有,如果你想一起完善这个脚本,欢迎pr 2020.5.22测试可用 使用方法 很简单,三部完成 下载代码,

163 Dec 30, 2022
一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

QQ音乐歌词爬虫 一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件,默认去除了所有演唱会(Live)版本的歌曲。 使用方法 直接运行python run.py即可,然后输入你想获取的歌手名字,然后静静等待片刻。 output目录下保存生成的歌词和歌名文件。以周杰伦为例,会生成两

Yang Wei 11 Jul 27, 2022
京东茅台抢购

截止 2021/2/1 日,该项目已无法使用! 京东:约满即止,仅限京东实名认证用户APP端抢购,2月1日10:00开始预约,2月1日12:00开始抢购(京东APP需升级至8.5.6版本及以上) 写在前面 本项目来自 huanghyw - jd_seckill,作者的项目地址我找不到了,找到了再贴上

abee 73 Dec 03, 2022
Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

Xavier Grangier 3.8k Jan 02, 2023
Poolbooru gelscraper - a simple python script for scraping images off gelbooru pools.

poolbooru_gelscraper a simple python script for scraping images off gelbooru pools. modules required:requests_html, and os by default saves files with

savantshuia 1 Jan 02, 2022
WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

Project A: WebScraper A script that prints out a list of all EXTERNAL references

2 Apr 26, 2022
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 45.5k Jan 07, 2023
An Web Scraping API for MDL(My Drama List) for Python.

PyMDL An API for MyDramaList(MDL) based on webscraping for python. Description An API for MDL to make your life easier in retriving and working on dat

6 Dec 10, 2022
Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers

Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers.

Louie Cai 13 Oct 15, 2022